Method and system for extending keyword searching to syntactically and semantically annotated data

ABSTRACT

Methods and systems for extending keyword searching techniques to syntactically and semantically annotated data are provided. Example embodiments provide a Syntactic Query Engine (“SQE”) that parses, indexes, and stores a data set as an enhanced document index with document terms as well as information pertaining to the grammatical roles of the terms and ontological and other semantic information. In one embodiment, the enhanced document index is a form of term-clause index, that indexes terms and syntactic and semantic annotations at the clause level. The enhanced document index permits the use of a traditional keyword search engine to process relationship queries as well as to process standard document level keyword searches. In one embodiment, the SQE comprises a Query Processor, a Data Set Preprocessor, a Keyword Search Engine, a Data Set Indexer, an Enhanced Natural Language Parser (“ENLP”), a data set repository, and, in some embodiments, a user interface or an application programming interface.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

This invention was made with government support under Contract No.DAAH01-00-C-R168, awarded by Defense Advanced Research Project Agencyand Contract No. W74Z8H-04-P-0104, awarded by the Office of theSecretary of Defense, U.S. Army. The government has or may have certainrights in this invention.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a method and system for searching forinformation in a data set, and, in particular, to enhanced methods andsystems for syntactically indexing and performing syntactic searching ofdata sets using relationship queries to achieve greater search resultaccuracy.

2. Background

Often times it is desirable to search large sets of data, such ascollections of millions of documents, only some of which may pertain tothe information being sought. In such instances it is difficult toeither identify a subset of data to search or to search all data yetreturn only meaningful results. The techniques that have beentraditionally applied to support searching large sets of data havefallen short of expectations, because they have not been able to achievea high degree of accuracy of search results due to inherent limitations.

One common technique, implemented by traditional keyword search engines,matches words expected to found in a set of documents through patternmatching techniques. Thus, the more that is known in advance about thedocuments including their content, format, layout, etc., the better thesearch terms that can be provided to elicit a more accurate result. Datais searched and results are generated based on matching one or morewords or terms that are designated as a query. Results such as documentsare returned when they contain a word or term that matches all or aportion of one or more keywords that were submitted to the search engineas the query. Some keyword search engines additionally support the useof modifiers, operators, or a control language that specifies how thekeywords should be combined when performing a search. For example, aquery might specify a date filter to be used to filter the returnedresults. In many traditional keyword search engines, the results arereturned ordered, based on the number of matches found within the data.For example, a keyword search against Internet websites typicallyreturns a list of sites that contain one or more of the submittedkeywords, with the sites with the most matches appearing at the top ofthe list. Accuracy of search results in these systems is thus presumedto be associated with frequency of occurrence.

One drawback to traditional keyword search engines is that they do notreturn data that fails to match the submitted keywords, even though thedata may be relevant. For example, if a user is searching forinformation on what products a particular country imports, data thatrefers to the country as a “customer” instead of using the term “import”would be missed if the submitted query specifies “import” as one of thekeywords, but doesn't specify the term “customer.” For example, asentence such as “Argentina has been the main customer for Bolivia'snatural gas” would be missed, because no forms of the word “import” arepresent in the sentence. Ideally, a user would be able to submit a queryand receive back a set of results that were accurate based on themeaning of the query—not just on the specific keywords used insubmitting in the query.

Natural language parsing provides technology that attempts to understandand identify the syntactical structure of a language. Natural languageparsers (“NLPs”) have been used to identify the parts of speech of eachterm in a submitted sentence to support the use of sentences as naturallanguage queries against data. However, systems that have used NLPs toparse and process queries against data, even when the data is highlystructured, suffer from severe performance problems and extensivestorage requirements.

Natural language parsing techniques have also been applied to extractingand indexing information from large corpora of documents. By theirnature, such systems are incredibly inefficient in that they requireexcessive storage and intensive computer processing power. The ultimatechallenge with such systems has been to find solutions to reduce theseinefficiencies in order to create viable consumer products. Severalsystems have taken an approach to reducing inefficiencies by subsettingthe amount of information that is extracted and subsequently retained asstructured data (that is only extracting a portion of the availableinformation). For example, NLPs have been used with InformationExtraction engines that extract particular information from documentsthat follow predetermined grammar rules or when a predefined term orrule is recognized, hoping to capture and provide a structured view ofpotentially relevant information for the kind of searches that areexpected on that particular corpus. Such systems typically identify textsentences in a document that follow a particular part-of-speech patternor other patterns inherent in the document domain, such as “trigger”terms that are expected to appear when particular types of events arepresent. The trigger terms serve as “triggers” for detecting suchevents. Other systems may use other formulations for specified patternsto be recognized in the data set, such as predefined sets of events orother types of descriptions of events or relationships based uponpredefined rules, templates, etc. that identify the information to beextracted. However, these techniques may fall short of being able toproduce meaningful results when the documents do not follow thespecified patterns or when the rules or templates are difficult togenerate. The probability of a sentence falling into a class ofpredefined sentence templates or the probability of a phrase occurringliterally is sometimes too low to produce the desired level of recall.Failure to account for semantic and syntactic variations across a dataset, especially heterogeneous data sets, has led to inconsistent resultsin some situations.

BRIEF SUMMARY OF THE INVENTION

Embodiments of the present invention provide enhanced methods andsystems for syntactically indexing and searching data sets to achievemore accurate search results with greater flexibility and efficiencythan previously available. Techniques of the present invention provideenhanced indexing techniques that extend the use of traditional keywordsearching techniques to relationship and event searching of data sets.In summary, the syntactic and/or semantic information that is gleanedfrom an enhanced natural language parsing process is stored in anenhanced document index, for example, a term-clause matrix, that isamenable to processing by the pattern (string) matching capabilities ofkeyword search engines. Traditional keyword search engines, includingexisting or even off-the-shelf search engines, can be utilized todiscover information by pattern (or string) matching the terms of arelationship query, which are associated with syntactic and semanticinformation, against the syntactically and/or semantically annotatedterms of sentence clauses (of documents) that are stored in the enhanceddocument index. In this manner, the relationship information of anentire corpus can be searched using a keyword search engine withoutneeding to limit a priori the types or number of relationships that arestored.

Example embodiments of the present invention provide an enhancedSyntactic Query Engine (“SQE”) that parses, indexes, and stores a dataset, as well as performs syntactic searching in response to queriessubsequently submitted against the data set. In one embodiment, the SQEincludes, among other components, a data set repository and an EnhancedNatural Language Parser (“ENLP”). The ENLP parses each object in thedata set and transforms it into a canonical form that can be searchedefficiently using techniques of the present invention. To perform thistransformation, the ENLP determines the syntactic structure of the databy parsing (or decomposing) each data object into syntactic units,determines the grammatical roles and relationships of the syntacticunits, associates recognized entity types and/or ontology paths ifconfigured to do so, and represents these relationships in a normalizedform. The normalized data are then stored and/or indexed as appropriatein an enhanced document index.

In one aspect, a corpus of documents is prepared for electronicsearching by parsing each sentence into syntactic elements, normalizingthe parsed structure to a plurality of tagged terms, each of whichindicate an association between the term and a type of tag, and thentransforming each sentence into a data structure that treats the taggedterms as additional terms of the sentence to be searched by a keywordsearch engine. In some embodiments, the tags include a grammatical roletag, a part-of-speech tag, an entity tag, an ontology pathspecification, or an action attribute. Other tags that indicatesyntactic and semantic annotations are also supported. In someembodiments, linguistic normalization is performed to transform thesentence.

In another aspect, the SQE supports a syntax and a grammar forspecifying relationship searches that can be carried out using keywordsearch engines. In one embodiment, the syntax supports a base componentthat specifies a syntactic search, a prepositional constraint component,a keyword (e.g., a document level keyword) constraint component, and ameta-data constraint component. One or more of the components may beoptional. In another embodiment, the components are combined usingdirectional operators that identify which query term has a desiredgrammatical role.

In yet another aspect, the SQE receives a query that specifies arelationship query using a term, tag type, or tag value. The SQEtransforms the query into a set of Boolean expressions that are executedby a keyword search engine against the data structure that has beenenhanced to include syntactic and/or semantic annotations. Indicators tomatching objects, such as clause, sentences, or documents are returned.In one embodiment, the data structure comprises a term-clause index, asentence index, and a document index.

In another aspect, the SQE performs corpus ingestion and executesqueries using parallel processing. According to one embodiment, eachquery is performed in parallel on a plurality of partition indexes,which each include one or more portions of the entire enhanced documentindex.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a relationship query and the results returned by an exampleembodiment of the InFact® 2.5 search engine.

FIG. 2 is an example block diagram that conceptually represents aterm-clause matrix that stores terms and enhanced indexing informationfor syntactic searching.

FIG. 3 is an example block diagram that conceptually represents atraditional term-document index.

FIG. 4 is an example block diagram of an example Syntactic Query Engine.

FIG. 5 is an overview of the steps performed by a Syntactic Query Engineto process data sets and relationship queries.

FIGS. 6A-6G are example screen displays that illustrate the generalcapabilities of the example user interface and the types of queries thatcan be executed by an example Syntactic Query Engine.

FIGS. 7A-7F are example display screens of the progression of an exampleRQL query submitted to a Syntactic Query Engine.

FIGS. 8A-8G are example screen displays of an interface associated withbrowsing ontology paths, viewing corpus metadata, and finding synonyms.

FIG. 9 is an example screen display of an interface associated withsetting preferences for constraining relationship searches.

FIG. 10 is an example screen display of an interface associated withdisplaying SQE query history.

FIGS. 11A-11G are example screen displays from an alternate graphicalbased interface for displaying and discovering genetic relationships.

FIG. 12 is a conceptual block diagram of the components of an exampleembodiment of a Syntactic Query Engine.

FIG. 13 is a block diagram of the components of an Enhanced NaturalLanguage Parser of an example embodiment of a Syntactic Query Engine.

FIG. 14 is a block diagram of the processing performed by an exampleEnhanced Natural Language Parser.

FIG. 15 is a block diagram illustrating a graphical representation of anexample syntactic structure generated by the natural language parsercomponent of an Enhanced Natural Language Parser.

FIG. 16 is a table that conceptually illustrates normalized data thathas been annotated with syntactic and semantic tags by the postprocessorcomponent of an Enhanced Natural Language Parser.

FIG. 17 is an example block diagram of data set processing performed bya Syntactic Query Engine.

FIG. 18 is a block diagram of query processing performed by an SyntacticQuery Engine.

FIG. 19 is an example flow diagram of relationship query processingsteps performed by an example Query Processor of Syntactic Query Engine.

FIG. 20 is an example block diagram of a general purpose computer systemfor practicing embodiments of a Syntactic Query Engine.

FIG. 21 is an example block diagram of a distributed architecture forpracticing embodiments of a Syntactic Query Engine.

FIG. 22 is a block diagram overview of parallel processing architecturethat supports indexing a corpus of documents.

FIG. 23 is a block diagram overview of parallel processing architecturethat supports relationship queries.

FIG. 24 is an example block diagram that shows parallel searching of anenhanced document index.

FIG. 25 is an example block diagram of an architecture of the partitionindexes that supports incremental updates and data redundancy.

FIG. 26 is an example conceptual diagram of the transformation of arelationship search into component portions that are executed using aparallel architecture.

FIG. 27 is an example flow diagram of the steps performed by abuild_file routine within the Data Set Preprocessor component of aSyntactic Query Engine.

FIG. 28 illustrates an example format of a tagged file built by thebuild_file routine of the Data Set Preprocessor component of a SyntacticQuery Engine.

FIG. 29 is an example flow diagram of the steps performed by thedissect_file routine of the Data Set Preprocessor component of aSyntactic Query Engine.

FIG. 30 is an example conceptual block diagram of a sentence that hasbeen indexed and stored in a term-clause index of a Syntactic QueryEngine.

FIG. 31 is an example conceptual block diagram of sample contents of adocument index of a Syntactic Query Engine.

DETAILED DESCRIPTION OF THE INVENTION

It is often desirable to search large sets of unstructured data, such ascollections of millions of documents, only some of which may pertain tothe information being sought. Traditional search engines approach suchdata mining typically by offering interactive searches that match thedata to one or more keywords (terms) using classical pattern matching orstring matching techniques. At the other extreme, information extractionengines typically approach the unstructured data mining problem byextracting subsets of the data, based upon formulations of predefinedrules, and then converting the extracted data into structured data thatcan be more easily searched. Typically, the extracted structured data isstored in a relational database management system and accessed bydatabase languages and tools. Other techniques, such as those offered byInsightful Corporation's InFact® products, offer greater accuracy andtruer information discovery tools, because they employ generalizedsyntactic indexing with the ability to interactively search forrelationships and events in the data, including latent relationships,across the entire data set and not just upon predetermined extracteddata that follows particular syntactic patterns. InFact®'s syntacticindexing and relationship searching uses natural language parsingtechniques to grammatically analyze sentences to attempt to understandthe meaning of sentences and then applies queries in a manner that takesinto account the grammatical information to locate relationships in thedata that correspond to the query. Some of these embodiments support anatural language query interface, which parses natural language queriesin much the same manner as the underlying data, in addition to astreamlined relationship and event searching interface that focuses onretrieving information associated with particular grammatical roles.Other interfaces for relationship and event searching can be generatedusing an application programming interface (“API”). Insightful'ssyntactic searching techniques are described in detail in U.S.Provisional Application Nos. 60/312,385 and 60/620,550, and U.S.application Ser. Nos. 10/007,299, and 10/371,399. The techniquesdescribed in these patent applications have typically employed the useof complex data bases with a proprietary search technology forperforming relationship and event searching.

Embodiments of the present invention provide enhanced methods andsystems for syntactically indexing and searching data sets to achievemore accurate search results with greater flexibility and efficiencythan previously available. Techniques of the present invention provideenhanced indexing techniques that extend the use of traditional keywordsearch engines to relationship and event searching of data sets. Insummary, the syntactic and semantic information that is gleaned from anenhanced natural language parsing process is stored in an enhanceddocument index, for example, a form of a term-clause matrix, that isamenable to processing by the more efficient pattern (string) matchingcapabilities of keyword search engines. Thus, traditional keyword searchengines, including existing or even off-the-shelf search engines, can beutilized to discover information by pattern (or string) matching theterms of a relationship query, which are inherently associated withsyntactic and semantic information, against the syntactically andsemantically annotated terms of sentence clauses (of documents) storedin the enhanced document index. As another benefit, the additionalcapabilities of such search engines, such as the availability of Booleanoperations, and other filtering tools, are automatically extended torelationship and event searching.

Relationship and event searching, also described as “syntacticsearching” in U.S. Application Nos. 60/312,385, 10/007,299, 10/371,399,and 60/620,550, supports the ability to search a corpus of documents (orother objects) for places, people, or things as they relate to otherplaces, people, or things, for example, through actions or events. Suchrelationships can be inferred or derived from the corpus based upon oneor more “roles” that each term occupies in a clause, sentence,paragraph, document, or corpus. These roles may comprise grammaticalroles, such as “subject,” “object,” “modifier,” or “verb;” or, theseroles may comprise other types of syntactic or semantic information suchas an entity type of “location,” “date,” “organization,” or “person,”etc. The role of a specified term or phrase (e.g., subject, object,verb, place, person, thing, action, or event, etc.) is used as anapproximation of the meaning and significance of that term in thecontext of the sentence (or clause). In this way, a relationship orsyntactic search engine attempts to “understand” the sentence when aquery is applied to the corpus by determining whether the terms insentences of the corpus are associated with the roles specified in thecorresponding query. For example, if a user of the search engine desiresto determine all events in which “Hillary Clinton” participated in as aspeaker, then the user might specify a relationship query that instructsa search engine to locate all sentences/documents in which “HillaryClinton” is a source entity and “speak” is an action. In response, thesyntactic search engine will determine and return indicators to allsentences/clauses in which “Hillary Clinton” has the role of a subjectand with some form of the word “speak” (e.g., speaking, spoke) or asimilar word in the role of a verb.

For example, FIG. 1 shows a relationship query and the results returnedby an example embodiment of the InFact® 2.5 search engine. In theInFact® 2.5 product, a user of the search engine can specify a searchfor a known “source” or “target” entity (or both) looking for actions orevents that involve that entity. The user can also specify a secondentity and look for actions or events that involve both the first andsecond entity. The user can specify a particular action or may specify atype of action or any action. An entity specified as a source entitytypically refers to the corresponding term's role as a subject (orsubject-related modifier) of a clause or sentence, whereas an entityspecified as a target typically refers to the corresponding term's roleas an object (or object-related modifier) of a clause or sentence. Anaction or event typically refers to a term's role as a verb, relatedverb, or verb-related modifier. Moreover, instead of a specific entity,the user can specify an entity type, which refers to a tag such as anitem in a classification scheme such as a taxonomy. A user can alsospecify a known action or action type and look for one or more entities,or entity types that are related through the specified action or actiontype. Many other types and combinations of relationship searches arepossible and supported as described in the above-mentioned co-pendingpatent applications.

In the example user interface shown in FIG. 1, a value for the firstknown entity is specified in entity field 102, a value for a knownaction is specified in action field 105, and a value for the type of thesecond entity is specified in entity type field 107. The source field103 and target field 104 indicate whether the first known entity is tobe a source of the action or a recipient (target) of the action. Theparticular query displayed instructs the search engine to look forsentence clauses that describe any person that drives a jeep when theFind Relationships button 106 is pressed. The results are returned inresult field 110, which is shown sorted by similarity to the query.

Example embodiments of the present invention provide an enhancedSyntactic Query Engine (“SQE”) that parses, indexes, and stores a dataset, as well as performs syntactic searching in response to queriessubsequently submitted against the data set. In one embodiment, the SQEincludes, among other components, a data set repository and an EnhancedNatural Language Parser (“ENLP”). The ENLP parses each object in thedata set (typically a document) and transforms it into a canonical formthat can be searched efficiently using techniques of the presentinvention. To perform this transformation, the ENLP determines thesyntactic structure of the data by parsing (or decomposing) each dataobject into syntactic units, determines the grammatical roles andrelationships of the syntactic units, associates recognized entity typesif configured to do so, and represents these relationships in anormalized form. The normalized data are then stored and/or indexed asappropriate.

In one set of example embodiments, which were described in U.S.Application Nos. 60/312,385, 60/620,550 10/007,299, and 10/371,399,normalized data structures are generated by an enhanced natural languageparser and are indexed and stored as relational data base tables. TheSQE stores the grammatical relationships that exist between thesyntactic units and uses a set of heuristics to determine whichadditional relationships to encode in the normalized data structure inorder to yield greater accuracy in results subsequently returned inresponse to queries. For example, the SQE may generate relationshiprepresentations in the normalized data structure that correspond to more“standard” ways to relate terms, such as the relationship represented bythe tuple (subject, verb, object), but may also generate relationshipsthat treat terms with corresponding certain grammatical roles in anon-standard fashion, such as generating a relationship representationthat treats a term that is a modifier of the subject as the subject ofthe sentence itself. This allows the SQE to search for a user specifiedentity (as a subject) even in sentences that contain the specifiedentity as a modifier instead of as the subject of the sentence. Forexample, the clause:

-   -   “the young boy bought a dog”        may be parsed and assigned the following grammatical roles:    -   boy=subject    -   young=modifier    -   bought=verb    -   dog=object        Relationship representations that correspond to (boy, bought,        dog), as well as a relationship representations that corresponds        to (young, bought, dog) may be generated and stored by the SQE.        Once the relationship representations are generated, they are        stored in a variety of as relational data base tables to        facilitate retrieval.

In the example embodiments of the SQE that are described herein, thenormalized data, including the grammatical role and other taginformation that can be used to discover relationships, are integratedinto enhanced versions of document indexes that are typically used bytraditional keyword search engines to index the terms of each documentin a corpus. A traditional keyword search engine can then search theenhanced indexing information that is stored in these document indexesfor matching relationships in the same way the search engine searchesfor keywords. That is, the search engine looks for pattern/stringmatches to terms associated with the desired tag information asspecified (explicitly or implicitly) in a query. In one such examplesystem, the SQE stores the relationship information that is extractedduring the parsing and data object transformation process (thenormalized data) in an annotated “term-clause matrix,” which stores theterms of each clause along with “tagged terms,” which include thesyntactic and semantic information that embodies relationshipinformation. Other example embodiments may provide different levels oforganizing the enhanced indexing information, such as an annotated“term-sentence matrix” or an annotated “term-document matrix.” Oneskilled in the art will recognize that other variations of storageorganization are possible, including that each matrix may be comprisedof a plurality of other data structures or matrices.

FIG. 2 is an example block diagram that conceptually represents aterm-clause matrix that stores terms and enhanced indexing informationfor syntactic searching. The term-clause matrix 201 is an inverted indexof tagged terms. That is, the matrix is indexed by the terms of eachclause of each sentence of each document and indicates which clausescontain which terms. The diagram is conceptual in that it doesn't implythat what is represented is stored in the SQE precisely in that matter.Different implementations may store the term separate from itsannotations and may be stored as a plurality of data structures thattogether comprise the term-clause index. For example, terms thatcorrespond to a particular grammatical role, for example, a “subject”may be stored separately than terms that correspond to a differentgrammatical role, for example an “object.” For example, in FIG. 2, eachrow 202 is indexed by a (tagged) term, e.g., “. . . /COUNTRY/China_subj”206, and each column, e.g., columns 203, 204, and 205, represents aclause and contains a value that represents the number of times (e.g., aword count) that the clause contains the indexed term. The diagram isconceptual in that it doesn't imply that what is represented is storedin the SQE precisely in that matter. Different implementations may storethe term separate from its annotations and may be stored as a pluralityof data structures that together comprise the term-clause index. Forexample, terms that correspond to a particular grammatical role, forexample, a “subject” may be stored separately than terms that correspondto a different grammatical role, for example an “object.”

For illustrative purposes, FIG. 2 shows a partial term-clause index thatcorresponds to the text of a given Document D1 that includes:

-   -   The president of France visited the capital of China in 1948.        From 1949 to 1960 China was in alliance with the Soviet Union,        although this relationship was already under severe strain in        the late 1950s.” From 1972 China aligned itself with the US        against perceived Soviet expansionism.        The portion shown corresponds to the second and third sentences        of the text, which together contain three clauses. (The indexing        of the first clause is not shown.) The rows 202 each contain a        term from one of these clauses, tag information that has been        associated with the term during the data object parsing and        transformation phase, and an indication of whether the clause        contains the term in the role that is indicated by the        associated tag information. That is, the terms are annotated        with syntactic (e.g., grammatical role) and semantic (e.g.,        entity/ontology tag) information. For example, the tagged term        “(ontology root node)/ENTITY/LOCATION/COUNTRY/China_subj” 206        consists of the term from the associated text “China,” a        grammatical role tag “subj” that indicates use of the term        “China” as a subject, and an ontology path to the an entity tag        “COUNTRY,” that indicates that the term “China” is known to have        an entity type of “COUNTRY” as determined from an ontology,        database, dictionary, or similar structure associated with the        SQE. The string “(ontology root node)” is a placeholder in the        figure for the real indicator (e.g., name) of the root node of        whatever ontology is being used. Also, depending upon the        particular ontology being used, there may be a series of        different nodes that contain the type “COUNTRY” (other than        “ENTITY/LOCATION”) and the SQE is programmed to take multiple        nodes into account, when ingesting the documents and when        searching for terms/tags in a relationship query that may be        ambiguously expressed. The tagged terms    -   “(ontology root node)/ENTITY/LOCATION/COUNTRY/Soviet Union_obj”        207 and    -   “(ontology root node)/ENTITY/LOCATION/COUNTRY/Soviet Union_prep”        208        associated with the same document term “Soviet Union” indicate        that the term is present in the document in two different        grammatical roles—the first clause contains the term as an        object and the third clause contains the term as a complement of        a prepositional phrase. Note also that several linguistic        normalizations have been performed during the data object        transformation process to the normalized data. For example, the        tense of the verb “was” has been changed to “be” (passive to        active) and the verb phrase “was in alliance” has been changed        to the verb “ally” (verbalization).

Several additional aspects are also notable with respect to theconceptual term-clause index illustrated in FIG. 2. The indexillustrates the use of custom specified portions of an ontology. In thiscase, in order to add verb sense information for a set of verbs (i.e.,group a set of verbs together), a “VERB” node that indicates differenttypes of verb sense information has been added to the ontology.Additional ontology information could be configured by a systemadministrator, or, alternatively, a user interface for dynamicallymodifying the ontology could be provided. In the particular portion ofthe ontology shown, two verb senses “VERB_CHANGE” and “VERB_STATIVE” arepresent. When the SQE ingests a verb that has not been categorized bythe ontology, the verb is simply added to the index without a semanticannotation, such as the verb “ally,” which has been indexed as“ally_verb. The same is true for other terms that correspond to otherparts of speech that have not been classified (yet) by the ontology. Forexample, the nouns “relationship,” “strain” and “expansionism” have beenindexed with syntactic annotations for their respective grammaticalroles, but do not have any associated semantic (ontology path)annotations. One skilled in the art will recognize that a variety ofcombinations could be represented in the term-clause index. Also notethat the concepts of wildcard interpretation can be implemented avariety of ways, including explicitly putting “generic” nodes thatcorrespond to particular types of wildcards (e.g., entity wildcards,physical_object wildcards, verb wildcards, etc.) depending upon thenodes in the ontology.

The integration of the enhanced indexing information into traditionalsearch engine type document indexes (for example, an inverted index) iswhat supports the use a standard keyword search techniques to find a newtype of document information—that is, relationship information—easilyand quickly. An end user, such as a researcher, can pose simple Booleanstyle queries to the SQE yielding results that are based upon anapproximation of the meaning of the indexed data objects. Becausetraditional search engines do not pay attention to the actual contentsof the indexed information (they just perform string matching or patternmatching operations without regard to the meaning of the content), theSQE can store all kinds of relationship information in the indexedinformation and use a keyword search engine to quickly retrieve it.

The SQE processes each query by translating or transforming the queryinto component keyword searches that can be performed against theindexed data set using, for example, an “off-the-shelf” or existingkeyword search engine. These searches are referred to herein for ease ofdescription as keyword searches, keyword-style searches, or patternmatching or string matching searches, to emphasize their ability tomatch relationship information the same way search terms can be string-or pattern-matched against a data set using a keyword search engine. TheSQE then combines the results from each keyword-style search into acohesive whole that is presented to the user.

For example, suppose a researcher is attempting to discover somethingabout China's relationships. In particular, suppose that the researcherwould like to know China's attitude toward other countries. Theresearcher accordingly enters a relationship query to the SQE, forexample,

-   -   China_subj AND *_verb AND COUNTRY_obj        (query 209) which instructs the SQE to find all clauses        (sentences and/or documents) in which China is a source entity        (used as a subject) along with any action (any verb) and a        second entity of entity type “COUNTRY” is the recipient of the        action. Note that the syntax of this query is a conceptual        example of a specification of a relationship query using the SQE        of the present invention. The SQE will automatically determine        that for this particular ontology the node “COUNTRY” is part of        a full ontology pathname of “(ontology root        node)/ENTITY/LOCATION/COUNTRY.” Many different language        specifications and user interfaces can be used to effectively        communicate this same instruction to the SQE, and one skilled in        the art will recognize that other alternatives are contemplated        for use with the SQE. (The query specification matches the way        the information is stored in the term-clause and other indexes.)        Using the example term-clause index shown in FIG. 2, the SQE        would respond with at least indicators to the second and third        sentences of the Document D1 as they both contain clauses with        the term “China” as the subject. Moreover, the results returned        indicate several different relationships, allowing the        researcher quickly to discover a lot about China's foreign        policy. For example, the following relationships would be        quickly discovered:    -   China (is) ally of the Soviet Union    -   China aligns itself with the United States        which upon first glance may appear contradictory. By further        drilling down to look at the returned clauses or sentences, the        researcher can quickly discover that China's alliance with the        Soviet Union ended in 1960.

In contrast to the term-clause index, the document index of atraditional keyword search engine system simply stores each term that ispresent in the document, along with an indication of the number of timesthe term appears in each document. FIG. 3 is an example block diagramthat conceptually represents a traditional term-document index. The termdocument index 301 includes rows indexed by the terms 302 of thedocument. Each column, for example columns 303-305, indicates the numberof times the indexed term (in each row) appears in the document. Inorder to pose a query to find out the same information against thisdocument index, the researcher needs to be much smarter about thecontent of the documents being searched, or, alternatively, willing toend up with a lot of potentially random information to search through.For example, the researcher could search for documents that contain“China” or documents that contain “China” and a list of alternativecountries to look for. In any case, because much of the informationconcerning China's role in each document is lost when stored in thistype of traditional document index, the results provided would tend tobe less informative.

FIG. 4 is an example block diagram of an example Syntactic Query Engine.A document administrator 402 adds and removes data sets (for example,sets of documents), which are indexed and stored within a data setrepository 404 of the SQE 401. When used with keyword style searchingtechniques, the data set repository 404 stores an enhanced documentindex as described above. In the example shown in FIG. 4, a subscriber403 to a document service submits queries to the SQE 401, typicallyusing a visual interface. The queries are then processed by the SQE 401against the data sets indexed in the data set repository 404. The queryresults are then returned to the subscriber 403. In this example, theSQE 401 is shown implemented as part of a subscription document service,although one skilled in the art will recognize that the SQE may be madeavailable in many other forms, including as a separate application/tool,integrated into other software or hardware, for example, cell phones,personal digital assistants (“PDA”), or handheld computers, orassociated with other types of existing or yet to be defined services.Additionally, although the example embodiment is shown and described asprocessing data sets and queries that are in the English language, oneskilled in the art will recognize that the SQE can be implemented toprocess data sets and queries in any language, or any combination oflanguages.

FIG. 5 is an overview of the steps performed by a Syntactic Query Engineto process data sets and relationship queries. Steps 501-505 address theindexing (also known as the ingestion) process, and steps 506-509address the query process. Note that although much of the discussionherein focuses on ingestion of an entire data set prior to searching,the SQE also handles incremental document ingestion and is describedbelow with respect to an example embodiment of the SQE architecture.Also, the configuration process that permits an administrator to set upontologies, dictionaries, sizing preferences for indexes and otherconfiguration and processing parameters is not shown.

Specifically, in step 501, the SQE receives a data set, for example, aset of documents. The documents may be received electronically, scannedin, or communicated by any reasonable means. In step 502, the SQEpreprocesses the data set to ensure a consistent data format. In step503, the SQE parses the data set, identifying entity type tags and thesyntax and grammatical roles of terms within the data set as appropriateto the configured parsing level. For the purpose of extending keywordsearching to syntactically and semantically annotated data, parsingsufficient to determine at least the subject, object, and verb of eachclause is desirable to perform syntactic searches in relationshipqueries. However, one skilled in the art will recognize that subsets ofthe capabilities of the SQE could be provided in trade for shortercorpus ingestion times if full syntactic searching is not desired. Forexample, as described in U.S. patent Publication No. 2003/0233224 (U.S.patent application Ser. No. 10/371,399), the parsing level may beconfigured using a range of parsing levels, from “deep” parsing to“shallow” parsing. Deep parsing decomposes a data object into syntacticand grammatical units using sophisticated syntactic and grammaticalroles and heuristics. Shallow parsing decomposes a data object torecognize “attributes” of a portion or all of a data object (e.g., asentence, clause, etc), such as entity types specified by a default orcustom ontology associated with the corpus or the SQE. In step 504, theSQE transforms the each parsed clause (or sentence) into normalized databy applying various linguistic normalizations and transformations to mapcomplex linguistic constructs into equivalent structures. Linguisticnormalizations include lexical normalizations (e.g., synonyms),syntactic normalizations (e.g., verbalization), and semanticnormalizations (e.g., reducing different sentence styles to a standardform). These heuristics and rules are applied when ingesting documentsand are important to determining how well the stored sentenceseventually will be “understood” by the system.

For example, the SQE may apply one or more of transformational grammarrules, lexical normalizations (e.g., normalizing synonyms, acronyms,hypernyms, and hyponyms to canonical or standard terms), semanticmodeling of actions (e.g., verb similarity), anaphora resolution (e.g.,noun and pronoun coreferencing resolution) and multivariate statisticalmodeling of semantic attributes. Multivariate statistical modeling ofsemantic attributes refers to applying the techniques used to determinesimilar verbs to other parts of speech, such as nouns and adjectives.These techniques as applied to verbs include such determinations as thefrequency weight of the primary sense of the verb; the set of troponymsassociated to this verb sense (other ways to perform this verb, e.g.,“sweep,” “carry,” and “prevail” are all troponyms of the verb “win”because they express ways to win); the set of hypernyms associated tothis verb sense (more generic classes of which this verb is a part,e.g., “win” is one way to “gain,” “get,” or “acquire”); and the set ofentailments associated with this verb sense (other verbs that must bedone before this verb sense can be done, e.g., “winning” entails“competing,” “trying,” “attempting,” “contending,” etc.). The ability totransform a term to alternatives so that similar actions and entitieswill also be searched for provides one important way to increase theability of the SQE to “understand” a search query and retrieve morerelevant results. Many transformational grammar rules also can beincorporated into the SQE. The transformational grammar rules may takemany forms, including, for example, noun, pronoun, adjective, and adverbverbalization transformations. Verbalization rules convert thedesignated part of speech to a verb. For example, the clause “X is aproducer of Tungsten” can be simplified to the clause “X producesTungsten.” Another example transformation rule is to simplify a clauseby changing it from passive to active voice. For example, the clause“the chart was created by Y” can be transformed to the clause “Y createdthe chart.”

In step 505, the SQE stores the parsed and transformed sentences in adata set repository. As described above, when the SQE is used with akeyword search engine, the normalized data is stored in (used topopulate) an enhanced document index such as the term-clause matrixshown in FIG. 2. After storing the data set, the SQE can processrelationship queries against the data set. In step 506, the SQE receivesa relationship query, for example, through a user interface such as thatshown in FIGS. 6A-6G below. Alternatively, one skilled in the art willrecognize that the query may be transmitted through a function call,batch process, or translated from some other type of interface. In step507, if necessary (depending upon the interface) the SQE preprocessesthe received relation query and transforms it into the relationshipquery language understood by the system. For example, if naturallanguage queries are supported, then the natural language query isparsed into syntactic units with grammatical roles, and the relevantentity and action terms are transformed into the query languageformulations understood by the SQE. In step 508, the SQE executes thereceived query against the data set stored in the data set repository.The SQE transforms the query internally into sub-queries as appropriateto the organization of the data in the indexes and executes atraditional keyword search engine (or its own version of keyword stylesearching) to process the query. In step 509, the SQE returns theresults of the relationship query, for example, by displaying themthrough a user interface such as the summary information shown in FIG.6B.

One skilled in the art will recognize that, although the techniques aredescribed primarily with reference to text-based languages andcollections of documents, similar techniques may be applied to anycollection of terms, phrases, units, images, or other objects that canbe represented in syntactical units and that follow a grammar thatdefines and assigns roles to the syntactical units, even if the dataobject may not traditionally be thought of in that fashion. Examplesinclude written or spoken languages, for example, English or French,computer programming languages, graphical images, bitmaps, music, videodata, and audio data. Sentences that comprise multiple words are onlyone example of a phrase or collection of terms that can be analyzed,indexed, and searched using the techniques described herein. One skilledin the art will recognize how to modify the structures and program flowexemplified herein to account for differences in types of data beingindexed and retrieved. Essentially, the concepts and techniquesdescribed are applicable to any environment where the keyword stylesearching is contemplated.

Also, although certain terms are used primarily herein, one skilled inthe art will recognize that other terms could be used interchangeably toyield equivalent embodiments and examples. In addition, terms may havealternate spellings which may or may not be explicitly mentioned, andone skilled in the art will recognize that all such variations of termsare intended to be included. Also, when referring to various data,aspects, or elements in the alternative, the term “or” is used in itsplain English sense, unless otherwise specified, to mean one or more ofthe listed alternatives. For example, the terms “matrix” and “index” areused interchangeably and are not meant to imply a particular storageimplementation. Also, a document may be a single term, clause, sentence,or paragraph or a collection of one or more such objects.

For example, the term “query” is used herein to include any form ofspecifying a desired relationship query, including a specialized syntaxfor entering query information, a menu driven interface, a graphicalinterface, a natural language query, batch query processing, or anyother input (including API function calls) that can be transformed intoa Boolean expression of terms and annotated terms. Annotated terms areterms associated with syntactic or semantic tag information, and areequivalently referred to as “tagged terms.” Semantic tags include, forexample, indicators to a particular node or path in an ontology or otherclassification hierarchy. “Entity tags” are examples of one type ofsemantic tag that points, for example, to a type of ENTITY node in anontology. In addition, although the description is oriented towardsparsing and maintaining information at the clause level, it is to beunderstood that the SQE is able to parse and maintain information inlarger units, such as sentences, paragraphs, sections, chapters,documents, etc., and the routines and data structures are modifiedaccordingly. Thus, for ease of description, the techniques are describedas they are applied to a term-clause matrix. One skilled in the art willrecognize that these techniques can be equivalently applied to aterm-sentence matrix and a term-document matrix.

In the following description, numerous specific details are set forth,such as data formats and code sequences, etc., in order to provide athorough understanding of the techniques of the methods and systems ofthe present invention. One skilled in the art will recognize, however,that the present invention also can be practiced without some of thespecific details described herein, or with other specific details, suchas changes with respect to the ordering of the code flow.

The Syntactic Query Engine is useful in a multitude of scenarios thatrequire indexing, storage, and/or searching of, especially large, datasets, because it yields results to queries that are more contextuallyaccurate than other search engines. An extensive relationship querylanguage (“RQL”) is supported by the SQE. The query language is designedto be used with any SQE implementation that is capable of retrievingrelationship information from an indexed data set, regardless of whetherthe SQE uses a relational database implementation with a proprietarysearch engine or an enhanced document index that supports a keywordsearch engine. However, some of the operators may be more easilyimplemented in one environment versus the other, or may not be availablein certain situations. One skilled in the art will recognize thatvariants of the query language are easily incorporated and that othersymbols can be equivalently substituted for operators.

In general, the syntax for a relationship query specifies “entities” and“actions” that are linked via a series of “operators” with one or moreconstraints such as document level filters.

Entity: An Entity is a noun or noun phrase in the search query orresult. It can be the source (initiator of an action), the target(receiver of an action), or the complement of a prepositional phrase.Entities can be multiple words. If they are quoted, the exact phrase ispreferably matched by a phrase in a document being searched. Eitherdouble quotes or single quotes may be used; if double quotes are used,then synonyms of the quoted expression will not be included in a search.If single quotes are used, synonyms of the quoted expression will beincluded. Synonyms are typically specified as properties of an ontologyrelated to the corpus or in a dictionary.

Source: The initiator of an action is referred to as the source. Forexample, in the query

-   -   [Country]>threaten>USA,        “Country” is the source. The query instructs a search for all        countries that threaten the US, but not all countries that the        US threatens.

Target: The receiver of an action is referred to as the target. Forexample, in the query

-   -   USA>investigates>[organization]        “organization” is the target of the action. The query instructs        a search for all political organizations that are the target of        an investigation, but not those that are initiating an        investigation.

Prepositional Complement: An action is often performed with aprepositional complement. For example, in the query

-   -   Maya>visit>grandmother PREP CONTAINS Tuesday        “Tuesday” is the prepositional complement of the sentence. The        query instructs a search for only visits that happened on        Tuesdays.

Action: All relationships are based on an action, or verb. For example,in the query

-   -   Maya>visit>grandmother        “visit” is the action.

Operators: The following example operators are supported:

-   -   Action directionality for events: <, >, < > (or alternatively ←,        →, ⇄)    -   Boolean: AND, OR, NOT. The default operation for omitted Boolean        operators is OR. Booleans do not have to be uppercase.    -   Prepositional constraint: PREP CONTAINS (upper or lowercase), or        ‘{circumflex over ( )}’    -   Document keyword constraint: DOCUMENT CONTAINS (upper or        lowercase), or ‘;’    -   Metadata constraint: METADATA CONTAINS (upper or lowercase), or        ‘#’    -   Wildcards (not within quotes): *, ? (single and multi-character)    -   Offset indicators: ˜    -   Curly braces { } are used for indirect link searches, to search        for entities that link other entities together    -   Brackets [ ] are used to denote types, either an OntologyPath,        or, if used with a verb, an ActionType.        Parenthesis can be used to nest portions of the query.

The general format for a relationship query comprises four components:

-   -   Syntactic query {circumflex over ( )} Prep constraints; Document        keyword constraints # Metadata constraints        The syntactic query component is specified in the format Source        Entity>Action>Target Entity. However, it is not necessary to        specify all three components, nor do the directional arrows need        to point to the right. For example,    -   Bush<*    -   Bush<*<*    -   >*>Bush        are all correct specifications of the entity “Bush” as he        related to other entities through any action, and there is no        difference between the first two or the last two. Although both        actions and entities can be represented by a wildcard, the        position of the wildcard in the query determines what it        represents. Entities preferably do not point to each other        directly.

In addition to the basic syntactic search component of the query, thereare three optional components that can be added to filter results(constrain the search):

-   -   any prepositional constraints, to filter results by information        found in a prepositional phrase;    -   any document keyword constraints, to restrict search to        documents that have certain keyword(s); (this causes a basic        keyword search)    -   any metadata constraints, to restrict search to documents tagged        with specific metadata values or ranges or values.        These clauses can be expressed in either a long or abbreviated        format. In the long format, the clauses are separated by the        self-explanatory terms “PREP CONTAINS”, “DOCUMENT CONTAINS” and        “METADATA CONTAINS”. For example, broken up into several lines        for easier reading, the relationship query:    -   Bush>visit>[Country] AND NOT China    -   PREP CONTAINS plane    -   DOCUMENT CONTAINS “foreign service” OR diplomat    -   METADATA CONTAINS Date>April 2002        specifies a syntactic search for “visit” relationships between        the entity “Bush” and any country except China. The relationship        query is constrained by the preposition “plane”, meaning that        the word plane must be included in a prepositional phrase within        this relationship, indicating travel by plane. The query is        further constrained by the document keywords/key phrases        “foreign service” and “diplomat,” meaning that only        relationships from documents containing these words should be        returned. Finally, the search is constrained by a date range,        and instructs the search engine to only search documents written        after April 2002. (This assumes that date related metadata has        been associated with the documents at time of data set        ingestion.) Date and numeric metadata ranges are specified with        “=”, “>”, “<”, “>=”, and “<=”. Put together, this query searches        specifically for diplomatic trips that Bush took by plane since        April 2002 to foreign countries with the exception of China.

Note that there are two expressions designated in the document filterabove: “foreign service” and “diplomat.” When a document contains akeyword in adjective form, e.g., “diplomatic,” the document is includedin the search results responsive to a query that designated the nounform. The SQE may be configured to automatically extract the stem of theword and search for other forms. Document level queries are also allowedby specifying a keyword or phrase (even without a syntactic searchcomponent). For example:

-   -   germany AND france AND england        will cause the SQE to search for all documents containing these        keywords.

Filter clauses (i.e., constraint components) can also be entered in amore abbreviated form, in which the terms “PREP CONTAINS”, “DOCUMENTCONTAINS”, and “METADATA CONTAINS” are replaced by a ‘{circumflex over( )}’, ‘;’ and a ‘#’ character respectively, as in:

-   -   Syntactic query {circumflex over ( )} Prep constraints; Document        keyword constraints # Metadata constraints        The example relationship query described above regarding        diplomatic trips that Bush took by plane can be rewritten in        abbreviated form as follows:    -   Bush>visit>[Country] AND NOT China {circumflex over ( )} plane;        “foreign service” OR diplomat # Date>April 2002        Also note that multiple Metadata constraints can be used with        complete Boolean expressions and that Boolean expressions can be        nested. For example, the query    -   hamas>act>* METADATA CONTAINS Author=“Andrew Jackson” OR        price=300        and the query    -   england AND NOT (aerospace OR airways)>abandon>* describe valid        relationship queries.

RQL formulated queries can also be embedded within a scripting languageto provide an ability to execute batch relationship queries, functionshaving multiple queries, and control flow statements. For example, itmay be desirable to encode a query to be executed at certain times eachday against a data set that is continually updated and incrementallyingested. One skilled in the art will recognize that many scriptinglanguages could be defined to achieve control flow of multiplerelationship queries, and that the scripting language could includeconditional statements. Relationship queries formulated using RQL aresubmitted to the SQE for execution from a variety of interfaces. Forexample, a web-based interface, similar to that provided by default withthe InFact® products, can be used to submit relationship queries. Inaddition, relationship queries can be submitted using a natural languageinterface to the SQE, which parses the natural language query intosyntactic units that can be translated into an RQL formulated query andthen executed. Alternatively, the SQE supports an API that allows thedevelopment of other code, such as other user interfaces, that canexecute relationship queries by submitting RQL formulated query stringsto the SQE. FIGS. 11A-11G described below exemplify one such interfacethat provides a more graphical use of relationship queries.

FIGS. 6A-6G, 7A-7F, and 8A-8G are example screen displays from anexample embodiment of a user interface designed to provide relationshipand event searching in accordance with the techniques of the presentinvention. These screen displays emphasize particular features of aquery language that has been designed to take advantage of combining theattributes of keyword style searching with syntactic searching.Additional examples of this user interface, query language, and variantsthereof are included in Appendices A and B, which are incorporatedherein by reference in their entirety.

FIGS. 6A-6G are example screen displays that illustrate the generalcapabilities of the example user interface and the types of queries thatcan be executed by an example Syntactic Query Engine. FIG. 6A is anexample initial screen display of a web-based interface for entering arelationship query to the SQE. There are five basic components of thisexample interface. Pressing the Search tab 6A03 displays (or generates)the page used to enter queries. The user enters an RQL formulated queryinto free text field 6A01. When ready, a search is initiated by pressingthe Search button 6A02. Alternatively, users can enter RQL syntax usinga “form” or template. The Show Query Generator link 6A08 navigates tothis alternative interface to build an RQL formulated query. Thisinterface is described further below with respect to Figure BF. Pressingthe Corpus tab 604 displays a page used to browse available ontologies,find out more information for a particular ontology path, browseavailable metadata, and find synonyms that are configured in the system.These capabilities are described further below with respect to FIGS.8A-8G. Pressing the Preferences tab 6A05 displays a page used to setsearch preferences. These capabilities are described further below withrespect to FIG. 9. Pressing the History tab 6A06 displays a page thatshows a history of prior relationship searches. The history page isdescribed further below with respect to FIG. 10. Pressing the Help tabdisplays a web page(s) of tutorial information and assistance. Anexample help file is included as Appendix A.

FIG. 6B is an example screen display of the format for displayingresults in response to a relationship query specified using therelationship query language. The query is entered in query input field6B01, and in this case indicates a search for everything that China buys(“china>buy>*”). A summary of the results of the search is displayed inresult area 6B00. Note that each “row”, for example row 6B02, representsa particular relationship that is discovered in the corpus. Instances ofthis relationship may be actually located in more than one sentence ordocument. Thus, the Action field indicates a count of the number oftimes the particular relationship occurs in the data currently beingdisplayed and summarized. For example, the first row 6B02 indicates thatat least 2 instances of China buying (U.S.) wheat exist in the corpus.In one embodiment, the data is “chunked” prior to display. Thus, whenused with chunked data, the number of instances of a particularevent/relationship is valid only to what is being displayed. Otherembodiments that calculate the entire result prior to display mayindicate the number of instances a relationship appears over the entirecorpus.

FIG. 6C is an example screen display of a more complex query thatincludes a Boolean operator and a document level filter. The queryspecified in query input field 6C01 includes two Boolean operators in aBoolean expression, “suicide AND (attack OR bombing)” as part of thesyntactic search specification and includes a document level filter.Specifically, the user has specified a relationship search that willassist the user to discover all suicide attacks that have killed peoplein Israel. The results are shown summarized in result area 6C00.Clicking on any one of the actions, for example, “kill [5]” labeled asaction 6C02, will cause the SQE to display the five instances in theclauses/sentences/documents in which the corresponding relationship isfound.

FIG. 6D is an example screen display of a link search using an entitytype. The query specified in query input field 6D01 instructs the SQE tosearch for all people or named persons that link Bush and Thatcher. Theresults displayed in result area 6D00 show each 3^(rd) person thatprovides a link between Bush and Thatcher. That is, the 3^(rd) personhas some relationship to Bush and has some (possibly separate)relationship to Thatcher. To discover the details of theserelationships, the user navigates to one of the displayed links such aslink 6D02 which indicates that Ronald Reagan is the person in common inthe indicated (indirect) relationship.

FIG. 6E is an example screen display of a search that specifies anentity type and an action type. The query specified in query input field6E01 instructs the SQE to search for all events in which the Pope tooksome action involving motion (e.g., driving) to some location. As can beseen in the results displayed in result area 6E00, a variety of actions,sorted by similarity using the sort button 6E02, are displayed. Notealso, that a nested search button 6E03 can be pressed to cause the nextquery to be applied to the results from the prior query. This supportsan iterative discovery process where a user progressively narrows asearch based upon relationship information received at each searchlevel.

FIG. 6F is an example screen display of a search that specifies ontologypaths in conjunction with a prepositional constraint. The queryspecified in query input field 6F01 instructs the SQE to search for allcorporate acquisitions, specifically as they relate to the amount ofmoney spent. The prepositional constraint specified by “{circumflex over( )} money” indicates that some amount of money needs be present in aprepositional phrase of each matching clause. For example, the resultsshown in result area 6F00 show a first relationship with a target entity6F02 in which a sawmill was bought for $2.7 million. Similarly, theresults show a second relationship where the preposition phrase thatincluded the money is associated with the action “buy” labeled 6F03.

The ontology path specified in the query, “[organization/name]” isdefined by an ontology associated with the system. Ontologies aretypically associated with a corpus at system configuration time,although one skilled in the art will recognize that they can bedynamically changed and the portions of the corpus that are affected bythe change, re-ingested. An ontology can be a default ontologyassociated with the SQE or a custom ontology generated for a specificcorpus. Ontology paths are enclosed in brackets, as in [person] or[country]. If a bracketed term is found in a relationship query, the SQEsearches the ontology[ies] for all paths matching the term. If there aremultiple matches, all matches are included in the search and results arecombined. For example, in a search query containing the type [person],the SQE will substitute with [IF/Entity/Person] to indicate use of thedefault ontology provided with the system. If another path exists in acustom ontology such as “MyOntology/People/Person,” this path is alsoincluded in the query and the results are combined. Ontology paths canbe browsed through an interface provided under the “Corpus” tab, asdescribed further below with respect to FIGS. 8A-8G.

FIG. 6G is an example screen display of the query generator interface.The form displayed in display area 6G00 is provided to assist a userwith specifying the components of a relationship query without needingintimate knowledge of the RQL syntax. The fields are labeled accordinglyto explain what the user can enter to create a proper RQL formulatedquery.

FIGS. 7A-7F are example display screens of the progression of an exampleRQL query submitted to a Syntactic Query Engine. In FIG. 7A, the usersubmits a query “s6 kinase < >*< >*” in query input field 7A01. When theuser presses the Search button 7A02, the SQE displays results in chunkedpages of relationship summary information as shown in FIG. 7B. Note thatthe results shown in FIG. 7B include relationships that have “s6 kinase”as a subject, e.g., row 7B03, and relationships that have “s6 kinase” asan object, e.g., row 7B04. By clicking on one of the displayed actions,for example the “abolish” action 7C01 in FIG. 7C, the user can navigateto the document (sentence or clause) that shows that relationship. FIG.7D is an example screen display of a document that has been navigated toby selecting an action link in a displayed relationship summary. Thehighlighted portion (i.e., shown as boxed herein) of the document text7D01 is the information that has been summarized in the search resultsdisplayed in FIG. 7C. FIG. 7E is an example screen display thatillustrates how the user might then go back and modify the query basedupon information gleaned while drilling down a particular search. Inthis case, based upon the actions retrieved in the highest level search,the user has decided to drill down and look at “s6kinase” as it blocksor regulates some other entity. FIG. 7F is an example screen displaythat illustrates that the SQE retrieves relationships having similarverbs to the verb sense specified in the query. In this case, the verb“modulate” is searched for as a similar verb to the user specified verb“regulate.”

FIGS. 8A-8G are example screen displays of an interface associated withbrowsing ontology paths, viewing corpus metadata, and finding synonyms.FIG. 8A is an example screen display of navigation used to browse adefault ontology path. When a user types a path specification into pathinput field 8A01 and presses the Find Ontology Paths button 8A02, thenthe corresponding additional subpaths are displayed in area 8A03. Theuser can select the “Show Roots” link 8A04 to show the roots of otherontologies available for that particular corpus. Note that an ontologytypically includes a hierarchical classification system (a taxonomy) aswell as properties associated with the nodes of the ontology and adictionary.

FIGS. 8B-8F are example screen displays from a different version of theuser interface, and are provided herein to illustrate how differentontologies may be associated with a single corpus. In FIG. 8B, severallinks to root nodes 8B02 are displayed. The user can either select oneof these nodes and begin browsing or type a specific path into pathinput field 8B01. In the example shown, the user selects the path“LocusLink” and browses a hierarchy (not shown) by selecting a next nodeon the path labeled “Gene”. The next ontology level below “Gene” isdisplayed in subpath area 8C03 of FIG. 8C. Note that according to thisversion of the interface, available metadata for the corpus is displayedin metadata display area 8C04. FIG. 8D is an example screen display ofan interface used to search for synonyms. Synonyms for a word specifiedin input field 8D01 are displayed in synonym display area 8D02. Otherinterfaces may provide links or other user interface components fornavigating to the metadata and synonym information. FIGS. 8E and 8Fillustrate the behavior of the interface when the user inputs a specificentity classification into path input field 8E01. In this case, when theuser types in the term “steroids,” the SQE responds by displayingindications 8F02 of all ontology paths that contain the entity type“steroids.”

FIG. 9 is an example screen display of an interface associated withsetting preferences for constraining relationship searches. There are anumber of preference settings associated with a given search that may becustomized to constrain search results or improve result display. Thefollowing options are illustrated on the Preferences page, and oneskilled in the art will recognize that other options can be provided:

-   -   Include negated actions: When this option is enabled,        relationships matching both the positive and negative sense of a        verb are displayed. If a user performed a search such as        “Clinton>visit>Russia”, the sentence “Due to heath reasons        Clinton did not visit Russia.” would only be returned if this        setting was set to true. By default Show Negated Actions is        disabled, and only positive actions are displayed.    -   Search modifiers along with entities: This option specifies        whether modifiers should be searched along with sources and/or        targets (as subjects and/or objects). In the above example        sentence “Bill visits beautiful, green pastures outside        Seattle,” if this property is set to true, then a search such as        “Bill>visit>Seattle” will return the above relationship. If this        property is false, then it will not, and only the query        “Bill>visit>pasture” would still yield this result.    -   Display modifiers: In the sentence “Bill visits beautiful, green        pastures outside Seattle,” “beautiful, green” is the prefix        modifier for pastures, and “outside Seattle” is the postfix        modifier. In a search like “Bill>visit>*, with this property set        to true the SQE will display modifiers along with pastures in        the target entity summary. If this property is set to false,        only the word ‘pastures’ will be displayed as the target in the        tabular display.    -   Enforce strict bi-directionality: When doing searches with        bi-directional arrows, such as “< >”, the search can be        interpreted in two different ways. For example, with the search        query “Clinton < > *< > Bush”, one might wish only to view        results in which Bush did something to Clinton XOR Clinton did        something to Bush. (XOR indicates an exclusive Boolean OR        operation.) Enforcing strict bi-directionality provides this        result. However, one might also wish to see instances in which        Bush and Clinton both did something to some other target        together. These additional results are displayed if strict        bi-directionality is not enforced.    -   Search ontology path name as term: If a user includes an        ontology path like “[city]” in a search query, then results with        cities are returned. However, the word “city” is not an instance        of an item in the ontology itself, and is not associated with        the ontology path. Therefore, without setting this preference,        one would not see results that contain the word “city.” This        preference is set to true to include results with the term        “city” in them as well as any terms defined by the ontology path        “city.”    -   Number of relationships per page: The user can set the number of        relationships to display on a single page of relationship        results. The smaller this value, the faster results will be        returned.    -   Number of documents per page: The user can set the number of        documents to display on a single page of document results. The        smaller this value, the faster results will be returned.    -   Sort scheme: This setting allows users to sort results in a        given chunk or batch of results according to one of several        sorting schemes, and to set the default sort scheme for all        future searches. Note that an individual result set can also be        sorted in the result display. If results are sorted using the        drop-down selection box on the results page, the setting does        not persist for subsequent searches.    -   Surrounding sentences to export: This option allows the user to        vary how much contextual information from the document is        included along with the sentences returned when the user exports        a result set to HTML.

FIG. 10 is an example screen display of an interface associated withdisplaying SQE query history. The history page displays a history queue1000 of all searches performed in the current browser session. If thebrowser dies, if you use another browser, or if you press the Clearbutton 1010, the history queue 1000 is reset. Clinking on one of links1001-1002 for any query in the Query column will navigate to the resultspage for that particular query. Clinking on one of the links 1003-1004in the Documents column will navigate to the set of documents thatcontain the results of that query. The “Depends On” column 1005indicates whether a given query depends on a previous query, for exampleas a result of executing a nested search.

FIGS. 11A-11G are example screen displays from an alternate graphicalbased interface for displaying and discovering genetic relationships.This interface could be generated, for example, using an API supportedby the SQE. Appendix C describes an example API that is supported by anexample embodiment of an SQE, and is herein incorporated by reference inits entirety. One skilled in the art will recognize that many differentAPIs can be provided to support accessing the functions of an SQE fromother code. In FIG. 11A, the user can select possible files thatcorrespond to various sets of genes that can be studied to discoverrelationships between them. In FIG. 11B, the user indicates a desire toselect the entity list to be displayed. In FIG. 11C, the user selectsthe “genes3.txt” file as the entity file to be displayed. In FIG. 11D,the user (optionally) selects an action list file, for displayingspecific types of relationships (based upon verbs). FIG. 11E shows theresults of all the relationships between the genes indicated in theentity list. Each dot represents a different gene and each line betweentwo genes represents a relationship evidenced by the corpus. Selectingtwo genes in the graphical user interface results in the specificationof an RQL formulated query to the SQE. FIG. 11F illustrates the resultsof selecting two of the genes in order to display the specificrelationships between them. In this case the user has selected theiqgap1 gene 11F03 and the q02248 gene 11F03 and the possible “actions”between them are displayed in relationship results area 11F01. In thiscase, the relationships include “interactions,” “regulation,” and“localization.” At this point, the user has gained information forfurther follow up. In FIG. 11G, two different genes (entities) 11G02 and1103 are selected to display relationships between them. The actionsbetween them are displayed in relationship results area 11G01. Note thatthe relationship query invokes a search for both genes as source andtarget in this example.

An SQE as described may perform multiple functions (e.g., data setparsing, data set storage, query transformation and processing, anddisplaying results) and typically comprises a plurality of components.FIG. 12 is a conceptual block diagram of the components of an exampleembodiment of a Syntactic Query Engine. A Syntactic Query Engine 1201comprises a Relationship Query Processor 1210, a Data Set Preprocessor1203, a Data Set Indexer 1207, an Enhanced Natural Language Parser(“ENLP”) 1204, a data set repository 1208, and, in some embodiments, auser interface (or an Applications Programming Interface “API”) 1313.The Data Set Preprocessor 1203 converts received data sets 1202 to aformat that the Enhanced Natural Language Parser 1204 recognizes. TheEnhanced Natural Language Parser (“ENLP”) 1204, parses the preprocessedsentences, identifying the syntax and grammatical role of eachmeaningful term in the sentence and the ways in which the terms arerelated to one another and/or identifies designated entity and otherontology tag types and their associated values, and transforms thesentences into a canonical form—a normalized data representation. TheData Set Indexer 1207 indexes the normalized data into the enhanceddocument indexes and stores them in the data set repository 1208. TheRelationship Query Processor 1210 receives relationship queries andtransforms them into a format that the Keyword Search Engine 1211recognizes and can execute. (Recall that the Keyword Search Engine 1211may be an external or 3^(rd) party keyword search engine that the SQEcalls to execute queries.) The Keyword Search Engine 1211 generates andexecutes keyword searches (as Boolean expressions of keywords) againstthe data set that is indexed and stored in the data set repository 1208.The Keyword Search Engine 1211 returns the search results through theuser interface/API 1213 to the requester as Query Results 1212.

In operation, the SQE 1201 receives as input a data set 1202 to beindexed and stored. The Data Set Preprocessor 1203 prepares the data setfor parsing by assigning a Document ID to each document that is part ofthe received data set (and sentence and clause IDs as appropriate),performing OCR processing on any non-textual entities that are part ofthe received data set, and formatting each sentence according to theENLP format requirements. The Enhanced Natural Language Parser (“ENLP”)1204 parses the data set, identifying for each sentence, a set of terms,each term's tags, including potentially part of speech and associatedgrammatical role tags and any associated entity tags or ontology pathinformation, and transforms this data into normalized data. The Data SetIndexer 1207 indexes and stores the normalized data output from the ENLPin the data set repository 1208. The data set repository 1208 representswhatever type of storage along with the techniques used to store theenhanced document indexes. For example, the indexes may be stored assparse matrix data structures, flat files, etc. and reflect whateverformat corresponds to the input format expected by the keyword searchengine. After a data set is indexed, a Relationship Query 1209 may besubmitted to the SQE 1201 for processing. The Relationship QueryProcessor 1210 prepares the query for parsing, for example by splittingthe Relationship Query 1209 into sub-queries that are executabledirectly by the Keyword Search Engine 1211. As explained above, aRelationship Query 1209 is typically comprised of a syntactic searchalong with optional constraint expressions. Also, different systemconfiguration parameters can be defined that influence and instruct theSQE to search using particular rules, for example, to include synonyms,related verbs, etc. Thus, the Relationship Query Processor 1210 isresponsible for augmenting the specified Relationship Query 1209 inaccordance with the current SQE configured parameters. To do so, theRelationship Query Processor 1210 may access the ontology informationwhich may be stored in Data Set Repository 1208 or some other datarepository. The Relationship Query Processor 1210 splits up the queryinto a set of Boolean expression searches that are executed by theKeyword Search engine 1211 and causes the searches to be executed. TheRelationship Query Processor 1210 then receives the result of eachsearch from the Keyword Search Engine 1211 and combines them asindicated in the original Relationship Query 1209 (for example, usingBoolean operators). One skilled in the art will recognize that theRelationship Query Processor 1210 may be comprised of multiplesubcomponents that each execute a portion of the work required topreprocess and execute a relationship query and combine the results forpresentation. The results (in portions or as required) are sent to theUser Interface/API component 1213 to produce the overall Query Result1212. The User Interface Component 1213 may interface to a user in amanner similar to that shown in the display screens of FIGS. 6A-6G and7A-7F.

FIG. 13 is a block diagram of the components of an Enhanced NaturalLanguage Parser of an example embodiment of a Syntactic Query Engine.The Enhanced Natural Language Parser (“ENLP”) 1301 comprises a naturallanguage parser 1302 and a postprocessor 1303. The natural languageparser 1302 identifies, for each sentence it receives as input, the partof speech for each term in the sentence and syntactic relationshipsbetween the terms each clause of the sentence. An SQE may be implementedby integrating a proprietary natural language parser into the ENLP, orby integrating an existing off-the-shelf natural language parser. Thepostprocessor 1303 examines the natural language parser 1302 output and,from the identified parts of speech and syntactic relationships,determines the grammatical role played by each term in the sentence andthe grammatical relationships between those terms. When entity tags orother types of semantic tags (indicating nodes in an ontology path) areused in addition to or in lieu of the grammatical relationships, thepostprocessor 1303 (or the natural language parser 1302 if capable ofrecognizing such tags) identifies, for each sentence (or clause whererelevant), each semantic tag type and its value. For example, the term“China” could be recognized as an entity type of “COUNTRY” having the(fully specified) ontology path indicator of“IF/ENTITY/LOCATION/COUNTRY.” The postprocessor 1303 then generates anenhanced data representation from the determined tags, including theentity tags, other ontology node tags, grammatical roles, and syntacticand grammatical relationships.

FIG. 14 is a block diagram of the processing performed by an exampleEnhanced Natural Language Parser. During document ingestion, the naturallanguage parser 1401 receives a sentence 1403 (or portion thereof) asinput, and generates a syntactic structure, such as parse tree 1404. Thegenerated parse tree 1404 identifies the part of speech for each term ineach clause of the sentence and describes the relative positions of theterms within the clause. In embodiments that support the recognition ofentity tags or other types of ontology path information, the parser 1401(or postprocessor 1402 if the parser is not capable) also identifies inthe parse tree (not shown) the semantic tag type for each correspondingterm in the sentence. The postprocessor 1402 receives the generatedparse tree 1404 as input, determines the grammatical role of each termin the clause and relationships between terms in the clause, andgenerates a normalized version of the sentence data annotated with thegrammatical role tags (syntactic tags) and semantic tags 1405.

FIG. 15 is a block diagram illustrating a graphical representation of anexample syntactic structure generated by the natural language parsercomponent of an Enhanced Natural Language Parser. The parse tree shownis one example of a representation that may be generated by a naturallanguage parser. The techniques of the methods and systems of thepresent invention, implemented in this example in the postprocessorcomponent of the ENLP, enhance the representation generated by thenatural language processor by determining the grammatical role of eachmeaningful term, associating these terms with their determined roles anddetermining relationships between terms. In embodiments in which thenatural language parser cannot support the recognition of semantic tags,one skilled in the art will recognize that the postprocessor component(such as Postprocessor 1303 in FIG. 13) can be programmed to enhance therepresentation with such tags. In FIG. 15, the top node 1501 representsthe entire sentence, “The president of France visited the capital ofChina in 1948.” Nodes 1502 and 1503 identify the noun phrase of thesentence, “The president of France,” and the verb phrase of thesentence, “visited the capital of China in 1948,” respectively. Thebranches of nodes or leaves in the parse tree represent the parts of thesentence further divided until, at each leaf level, each term is singledout and associated with a part of speech. A configurable list of wordsare ignored by the parser as “stopwords.” The stopword list compriseswords that are deemed not indicative of the information being sought.Example stopwords are “a,” “the,” “and,” “or,” and “but.” In oneembodiment, question words (e.g., “who,” “what,” “where,” “when,” “why,”“how,” and “does”) are also ignored by the parser. In this example,after ignoring the determinant “The” (node 1504), nodes 1508 and 1509identify the noun phrase 1505 as comprising a noun, “president” and aprepositional phrase, “of France.” Nodes 1512 and 1513 divide theprepositional phrase 1509 into a preposition, “of,” and a noun,“France.” Nodes 1506 and 1507 divide the verb phrase 1503 into a verb,“visit,” (morphological form of “visited”) and a noun phrase, “thecapital of China in 1948.” Nodes 1510 and 1511 divide the noun phrase1507 ultimately after several additional steps into a determinant “The”(node 1514), which may be ignored as a stopword; a noun “capital” (node1515); a preposition “of” (node 1518); a noun “China” (node 1519); apreposition “in” (node 1520); and a noun “1948” (node 1521).

FIG. 16 is a table that conceptually illustrates normalized data thathas been annotated with syntactic and semantic tags by the postprocessorcomponent of an Enhanced Natural Language Parser. Depending upon theimplementation of the ENLP, the normalized data may or may not be storedin an intermediate data structure prior to being indexed and stored inthe enhanced document indexes, such as the term-clause index. Theexample normalized data representation illustrates annotations appliedto the sentence that was illustrated in the parse tree of FIG. 15. Theannotations are of course dependent upon the ontology root nodespecified (which in this case is a default ontology root node called“IF”) and whether the SQE has been configured to parse with semantictags. Also, one skilled in the art will recognize that the selectedroles and relationship information to be stored may be programmaticallydetermined. In the example shown, row 1601 shows the indexinginformation for the term “president” and specifies that the term isassociated with a grammatical role of “subject” and has been tagged as atype of person (relative to the ontology being used). The SQE alsorecognizes and maintains information that the subject of this clause isassociated with a (suffix) modifier term “France,” which has been taggedas a type of country. The SQE maintains modifier information forsubjects, objects, and prepositional phrases, because, in someconfigurations, the SQE can search for specified subject, object, and/orprepositional constraint terms in addition as modifiers, therebyreturning documents that potentially may be relevant even though thesentence clauses didn't include the specified terms precisely assubjects, objects, or complement of a preposition. Row 1602 shows theindexing information for the term “visited” and specifies that the termis associated with the grammatical role of “verb.” Note that the SQEstores the stemmed form of the verb “visit” so as to potentially matchmore forms of the verb. Other heuristics could be similarlyincorporated. Row 1603 shows the indexing information for the term“capital,” including that the term is tagged with a grammatical role of“object” and is associated with two suffix modifiers “China” and “1948,”the first of which is tagged as a country (and a location and an entity)and the second of which is tagged as a date (and a numeric value and anentity). Note that these terms are maintained by the SQE as modifierseven though they are also maintained as prepositional complements foruse in relationship queries that filter based upon prepositionalconstraints. Row 1604 shows the indexing information for the term“China,” including that the term is tagged with a grammatical role of“prepositional complement” and a semantic tag that specifies that theterm is a kind of date. Similarly, row 1605 shows the indexinginformation for the term “1948,” including that the term is tagged witha grammatical role of “prepositional complement” and a semantic tag thatspecifies that the term is a kind of country (and location and entity).Row 1606 shows the additional sentence/clause information, which in thiscase is an indication that the clause is a “temporal” one. Clause andsentence information may indicate, for example, that the clause relativeto other clauses in the sentence is a conditional clause, a causalclause, a prepositional clause, or a temporal clause or that thesentence is a question, a definition, or contains temporal or numericalinformation. One skilled in the art will recognize that otherclassifications of interclause relationships and of sentences may alsobe incorporated. Also, other linguistic heuristics can be used togenerate enhanced indexing information indicated by the normalized dataproduced by the ENLP. For example, in some implementations, the ENLPprovides “co-referencing” analysis, which allows the ENLP to replacepronouns with nouns, or nouns, pronoun phrases, noun phrases, aliases,abbreviations, acronyms, etc. with a corresponding identifying noun.This capability allows greater search accuracy, especially whensearching for specific entity names.

Note that the normalized data shown in FIG. 16 supports many differenttypes of relationship queries. For example, all of the followingrelationship queries will cause the SQE to return an indicator to thesentence that has been normalized to the data of FIG. 16 (assumingmodifiers are searched):

-   -   >visits>[country] (Query for information on all visits of all        countries.)    -   president < > * (Query for anything a president does.)    -   *>*>China (Query for any relationship with China.) (Note that        the SQE returns the sentence because it searches for “China” as        a modifier instead of just as an object of the sentence.)    -   *>*>[country] (Query for any relationship with any country.)    -   France < > *< > China (Query for any relationship b/n France &        China.) (Note that the SQE returns the sentence because it        searches for “France” and “China” as modifiers instead of just        as subjects and/or objects of the sentence.)        Thus, the normalized data demonstrated by FIG. 16 is supportive        of and responsive to a very flexible style of specifying        relationship queries.

The Syntactic Query Engine performs two functions to accomplisheffective relationship query processing with syntactic searchingcapabilities. The first is the parsing, indexing, and storage of a dataset (sometimes termed corpus ingestion). The second is the queryprocessing, which according to the example embodiment described herein,results in the execution of keyword searches. These two functions areoutlined with reference to FIGS. 17-19.

FIG. 17 is an example block diagram of data set processing performed bya Syntactic Query Engine. As an example, documents that make up a dataset 1701 are submitted to the Data Set Preprocessor 1702 (e.g.,component 1203 in FIG. 12). If the data set comprises multiple files, asshown in FIG. 17, in one embodiment the Data Set Preprocessor 1702creates one tagged file containing the document set. The Data SetPreprocessor 1702 then dissects that file into individual sentences andsends each sentence to the ENLP 1704 (e.g., component 1204 in FIG. 12).After the ENLP 1704 parses each received sentence, it sends thegenerated normalized data that corresponds to each clause of eachsentence (e.g., data such as that represented by FIG. 16) to the DataSet Indexer 1705 (e.g., component 1207 in FIG. 12). The Data Set Indexer1705 processes the ENLP output, indexing and storing the information ina format that is dependent upon the storage representation of theenhanced document indexes (for example, the term-clause, term-sentence,and term-document indexes). One skilled in the art will recognize thatother methods of data set preprocessing, indexing, and storing may beimplemented in place of the methods described herein, and that suchmodifications are contemplated by the methods and systems of the presentinvention. For example, the data may be indexed according to a varietyof schemes and distributed across a plurality of repositories.

In addition to indexing and storing a data set prior initially, in someembodiments, the SQE can incrementally index and store new documents,updating the relevant enhanced document indexes as necessary. Inaddition, in embodiments that support dynamic changes to an existingontology, the SQE can determine a set of affected documents and“re-ingest” a portion of the corpus as needed. Other variations can besimilarly accommodated.

After indexing and storing a data set, the SQE may perform its secondfunction, processing relationship queries against the stored data set.FIG. 18 is a block diagram of query processing performed by an SyntacticQuery Engine. A user 1801 (or program through an API) submits arelationship query 1810 to the SQE. The Query Processor 1802 componentof the SQE transforms the query into one or more keyword searches 1811with appropriate syntactic and semantic annotation information includedand executes the keyword searches 1811 by invoking one or more keywordsearch engine processes, for example, keyword search engines 1804-1807.The results of each keyword search 1811 are subsequently returned backto the invoking Query Processor 1802, which then combines the results1812 as specified in the relationship query 1810 and returns them to theuser/program.

FIG. 19 is an example flow diagram of relationship query processingsteps performed by an example query processor of Syntactic Query Engine.The query processor executes one or more of steps example 1901-1907 foreach query that is forwarded from the user interface/API supportmodules. One skilled in the art will recognize that the precisebehaviors of each step depend upon the heuristics and other rules thatare encoded, the preferences set for search parameters, and the way thenormalized data is actually stored in the term-clause, term-sentence,and term-document indexes. In step 1901, the query processor receives arelationship query. Recall that the relationship query of the examplesyntax described above specifies a syntactic search portion (which maybe empty), prepositional constraints, document level keyword filters,and meta-data filters. Also, it is possible to specify values for anyone of the relationship query components without the others. Dependingupon the implementation, the query processor may include a relationshipquery interpreter or parser (not shown) to parse the received query intoits constituent parts and to produce some form of code (internallyspecified, using a standard programming language, or otherwise) thatcontrols the flow of the keyword searches that are invoked and thecombining of the results. This approach is especially useful with asyntax as described that follows a prescribed grammar. The relationshipquery is than transformed as necessary in example steps 1902-1907 inaccordance with the implementation.

In step 1902, the query is transformed to handle synonyms of anyspecified subjects and/or objects. In one embodiment, synonyms arehandled by searching the ontology structure for synonyms of a specifiedterm, and, if they are present, adding keyword searches for each synonymfound. In an alternative embodiment, terms having synonyms are mapped(e.g., at SQE configuration time) to a common indicator, such as a“concept identifier” (concept ID). During ingestion, terms are looked upin the map to determine whether they have corresponding synonyms (henceconcept IDs), and, if so, the concept IDs are stored as part of theindexing information. Upon receiving a query, a look up is performed tofind a corresponding concept ID (if one exists) to a received term. Thequery is then transformed so that the resultant keyword searches containthe corresponding concept ID as appropriate. One skilled in the art willrecognize that, using either mechanism (or any other implementation),the formatting of the invoked keyword searches needs to correspond tothe way the data has been indexed.

In step 1903, the query processor transforms the query to handleontology path specifications or “types” if provided in the receivedquery string. For example, a relationship query may provide a subjectand/or object list as [entity] or [person] or [location/country], etc.,which is interpreted as a type of node in an ontology hierarchy. Theamount of the pathname that is specified is matched to the ontology.Thus, the entity specification “[location/country]” is matched to anyontology path containing that sub-path. Keyword searches are thusspecified for each of the matching ontology paths. Similarly, heuristicsmay be applied that include as additional keyword searches also searchesfor related terms, such as hypernyms and hyponyms (more generic and morespecific classification terms, respectively), if not already accountedfor using available synonym logic.

In step 1904, the query processor transforms the query to handle actiontypes (types of verbs) if specified in the relationship query. Forexample, a query that specifies “president < > [communication]”instructs the SQE to find all relationships that involve a presidentdoing something by any verb that is considered to be a communicationverb. Like the implementations for synonyms described above, the queryprocessor can handle this by including additional keyword searches foreach verb of that action type, or can use some kind of verb conceptidentifier. Again, the query processor needs to match whatever form theindexed data is stored.

In step 1905, based upon the additional transformations from steps1902-1904, the query processor reformulates the relationship query intoone or more keyword searches that can be executed by a keyword searchengine. In step 1906, the one or more keyword searches are accordinglyinvoked and executed. If the enhanced document index is stored as onedata structure, then it is possible to execute one keyword search.Alternatively, if the indexed data is actually split between severalmatrices, then a keyword search is executed on each index asappropriate. For example, searches for matching “keywords” as subjects(or modifiers of subjects) are executed on the subject term-clauseindex. In step 1907, the results of the keyword searches are combined asexpressed in the flow of control logic parsed from the relationshipquery, and then forwarded to an interface for presentation to the useror program that invoked the relationship query. The query processor thenreturns to the beginning of the loop in step 1901.

The functions of data set processing (data object ingestion) andrelationship query processing can be practiced in any number ofcentralized and/or distributed configurations of client—server systems.Parallel processing techniques can be applied in performing indexing andquery processing to substantial increase throughput and responsiveness.Representative configurations and architectures are described below withrespect to FIGS. 20-25; however, one skilled in the art will recognizethat a variety of other configurations could equivalently perform thefunctions and capabilities identified herein.

FIG. 20 is an example block diagram of a general purpose computer systemfor practicing embodiments of a Syntactic Query Engine. The computersystem 2001 contains one or more central processing units (CPUs) 2002,Input/Output devices 2003, a display device 2004, and a computer memory(memory) 2005. The Syntactic Query Engine 2020, including the QueryProcessor 2006, Keyword Search Engine 2007, Data Set Preprocessor 2008,Data Set Indexer 2011, Enhanced Natural Language Parser 2012, and dataset repository 2015, preferably resides in memory 2005, with theoperating system 2009 and other programs 2010 and executes on the one ormore CPUs 2002. One skilled in the art will recognize that the SQE maybe implemented using various configurations. For example, the data setrepository may be implemented as one or more data repositories stored onone or more local or remote data storage devices. Furthermore, thevarious components comprising the SQE may be distributed across one ormore computer systems including handheld devices, for example, cellphones or PDAs. Additionally, the components of the SQE may be combineddifferently in one or more different modules. The SQE may also beimplemented across a network, for example, the Internet or may beembedded in another device.

FIG. 21 is an example block diagram of a distributed architecture forpracticing embodiments of a Syntactic Query Engine. This architecturesupports parallel processing of the indexing (ingestion) of eachdocument as well as parallel query processing. The basic organizationinvolves storing a portion of each (term-clause, sentence, and document)index on multiple machines (e.g., servers), with potentially multipleCPUs, in order to achieve greater throughput and accommodate theextensive storage requirements of a very large corpus. For example,typically a large corpus will easily exceed the CPU and storage limitsof a single server machine. Moreover, to provide commercially viablesearch solutions, the SQE needs to respond to queries in a timelyfashion. Thus, the number of servers and CPUs is typically determined bythe expected size of the data set and the desired query response time,and is typically set up during SQE configuration.

The unit of organization used to support indexing and searching istermed a “partition.” Thus, an enhanced document index (labeled here asa “keyword index”) comprises typically a plurality of “partitionindexes,” each of which stores some portion of the total keyword index.To perform a search on the entire corpus, then, it is necessary tosearch each of the partition indexes (with the same keyword searchstring) and thereafter to combine the results as if the search wereperformed on a single index. Note that the keyword index may bepartitioned according to a variety of schemes, including, for example, apercentage of the index based upon the size of the documents indexed,documents that somehow related together by concept or otherclassification, schemes based upon storing portions of the index basedupon a type supported by the ontology, etc. Any such scheme may beimplemented by the servers and may be optimized for the application forwhich the SQE is being deployed.

A variety of servers and services are employed to process the ingestionand searching on the backend so as to present a unified view of theterm-clause, sentence, and document indexes. FIG. 21 presents one suchembodiment, although one skilled in the art will recognize that avariety of other organizations and components can equivalently supportand provide the functions and techniques of the SQE. In FIG. 21, anindex manager 2101 schedules document ingestion for a collection ofdocument 2110 between a plurality of workers 2102 a-2102 d, eachresponsible for indexing a portion of the corpus. The work could bedivided at a variety of levels including by document, by sentence, etc.,and allows the ingestion workload to be processed in parallel, thusdecreasing the amount of time required to ingest a corpus. Each worker2102 a-2102 d contains an instance of the SQE data set processingcomponents (and others if necessary), including the preprocessor and aninstance of the ENLP. Upon parsing a sentence and annotating it withsyntactic and semantic tags, the worker 2102 a-2102 d creates acorresponding temporary keyword index 2103 a-2103 d, which representsthe portion of the corpus that it has processed until stored in thepartition indexes 2104-2105. The index manager 2101 is responsible fordistributing the temporary keyword indexes 2103 a-2103 d to thepartition indexes 2104 and 2105 to be merged into their respectivekeyword indexes 2106 and 2107. Note that the index manager 2101 and theworkers 2102 a-2102 d may in some embodiments utilize an additional database management system 2120 to store recovery information, such ascopies of documents, document metadata, sentences, parse trees and acopy of the clause tables, 2130 although this is a convenience and notnecessitated by the functions of the SQE.

FIG. 22 is a block diagram overview of parallel processing architecturethat supports indexing a corpus of documents. This figure shows onearrangement of servers that can be used to effect the parallelprocessing architecture of FIG. 21. Specifically, AdminClient 2201controls invocation of an IndexManager (server) 2202 which storesworking and recovery information in a database 2203 (if part of aparticular implementation) and distributes indexing work to one or moreIndexWorkers (servers) 2204. When an IndexWorker 2204 completes indexingof an object (document, sentence, etc.), notification is returned to theIndexWorker 2202, which at appropriate times instructs a correspondingPartitionIndex (server) 2205 to store the indexing information in theappropriate clause, sentence, and document indexes. Each IndexWorker2202 may also communicate with a WebServer 2206 to deliver status anderror information.

FIG. 23 is a block diagram overview of parallel processing architecturethat supports relationship queries. The partition indexes, such asPartition Index A 2104 and Partition Index B 2105 (in FIG. 21), may bearranged in a hierarchy of searcher (servers), and more than onepartition index may be managed by a single searcher. Typically, it isadvised to have a separate partition index for each CPU present in aserver machine to take advantage of inherent parallel processingopportunities in a multiple CPU/parallel processor, machine; however,other arrangements are also possible. In FIG. 23, a user such as aresearcher using a web browser user interface 2301 or an applicationusing the SQE APIL 2302 issues a relationship query to the SQE asdescribed in detail in the other figures via some supportedcommunications protocol, such as HTTP. (Note also that a server sideapplication that resides on the search service server 2311 could alsoissue a direct request to the search service 2304.) WebServer 2303receives the relationship query and issues appropriate search requeststo the SearchService 2304. Note that depending upon the particularimplementation, the various functional components described by FIG. 12and multiple instances of the same components could reside upon one ormore of these servers. The query is preferably organized into aplurality of keyword and ontology searches that are distributed to beprocessed in parallet and then combined before returning a result to theWebServer 2303. (The returned result flow is not shown.) Thus, searchservice 2304 invokes a “top” level search 2305 which is responsible forconducting the parallel searches to effectuate a search of the entirekeyword index. Searcher 2305 is shown communicating via a remote methodinvocation protocol to a single partition index server 2308. Searcher2305 instructs (sub)searcher 2307 to also perform part of the search.Searcher 2307 is shown communicating with two partition indexes, 2309and 2310. The searcher 2305 also communicates with a (possibly hierarchyof) ontology searchers 2306 as needed to search for pathnames in theontology (and for browsing the ontology as supported by other aspects ofan example SQE user interface).

FIG. 24 is an example block diagram that shows parallel searching of anenhanced document index. In FIG. 24, a search service 2401 receives asearch and distributes the requested relationship search to a top levelsearcher 2402. The top level searcher 2402 then, in parallel, invokesthe same relationship search on a plurality of searchers 2403-2405,depending upon the organization of the partition indexes and whether itis required to search all of them for a particular relationship query.For example, if the partition indexes are organized such that apercentage of the corpus is indexed on each (not by entity type or someother organization), then all of the partition indexes are searched inparallel. Searcher 2403 performs the relationship search on partitionindex 2410, searcher 2404 performs the relationship search on partitionindexes 2422 and 2423, up through searcher 2405 performs therelationship search on partition index 2424. Also, if an ontology search(for synonyms, pathnames, etc.) is required, then the top searcher 2402invokes a top level ontology searcher 2406 to perform (in parallel asrequired) an ontology search using one or more ontology searchers suchas searcher 2407 to search one or more ontology data repositories 2408and 2409.

As mentioned, it is sometimes desirable to support the indexing ofadditional corpus information even when the corpus is being searched.This provides the ability to support incremental indexing of data. It isalso sometimes desirable to provide fault tolerance, especially inmission critical applications. FIG. 25 is an example block diagram of anarchitecture of the partition indexes that supports incremental updatesand data redundancy. The underlying organization involves maintainingseveral data instances of the partition index, only one of which is“active” for searching at any one time and maintaining a redundant copyof the data instances that comprise the partition index. The “active”partition index data instance provides the view of the data that theinitiator of a query believes is current. To update a partition index,the searcher redirects the indicator of the active partition index datainstance to a different data instance. In FIG. 25, the searcher 2501maintains a master partition index 2502 and a clone partition index1203, which is a replica of the master partition index. Each of thepartition indexes 2502 and 2503 in turn maintain a plurality of datainstances, for example data instances 2510-2512 and 2520-2522. In thediagram, partition index data instance 2511 is indicated as the “active”partition index data instance. While instance 2511 is active, thesearcher 2501 can update other data instances 2510 and 2512 thusproviding another type of parallelism. Since clone partition index 2503is a replica of the master partition index 2502, the data instances2520-2522 are replicas of the information and state of data instance2510-2512. One skilled in the art will recognize that there are otherways to provide incremental updating and that FIG. 25 illustrates one ofthem.

The architectures described (and others) can be used to support theindexing and searching functions of an example SQE. FIG. 26 is anexample conceptual diagram of the transformation of a relationshipsearch into component portions that are executed using a parallelarchitecture. In the example illustrated, the relationship query 2601 isa link search, however one skilled in the art will recognize that thetechnique described can be applied and extended to a variety of searchesincluding a plurality of relationship searches that are combined by ascripting language or other means of controlling flow. The query beingprocessed:

-   -   Arafat < > {[organization]} < > Abu Nidal        Instructs the SQE to find all relationship where there is a        3^(rd) entity that is an organization linking Arafat and Abu        Nidal. In this case, the SQE transforms the query into two        syntactic sub-searches 2602 and 2603:    -   Arafat < > *< > [organization]        which will locate all organizations with which Arafat has any        kind of relationship; and    -   Abu Nidal < > *< > [organization]        which will locate all organizations with which Abu Nidal has any        kind of relationship. Each of these syntactic searches 2602 and        2603 are executed using, for example, the parallel architecture        described with reference to FIGS. 22-25. The syntactic search        2602 is distributed to a top searcher 2604 to perform one or        more syntactic searches on the partition indexes that make up        the corpus and one or more ontology searches as required. Note        that as part of this process, the various searchers invoke one        or more keyword search engines to perform the actual keyword        search on the annotated indexes. Similarly, the syntactic search        2603 is distributed to a top searcher 2605 to perform one or        more syntactic searches on the partition indexes that make up        the corpus and one or more ontology searches as required. Again,        keyword search engines are invoked as part of this process. Once        results from the sub-searches are determined, the query        processor, for example, one residing in a search service (such        as search service 2401 in FIG. 24) determines based upon the        initial query 2601 how to combine the results. In the example        described, the intersection of the resulting clauses provides        the overall query result 2607 desired. One skilled in the art        will recognize that similar combinations of sub-searches can be        accommodated. Those that indicated a desired intersection (as        from a Boolean AND operation) are easily specified. However, to        support other types of control flow operations, such as those        that require a union of the resultant data, needs to be defined        as to what aspects are desired to be combined especially if the        sub-searches yield different types of results.

The architectures illustrated (and others) can also support thepreprocessing and data storage functions of an example SQE. As describedwith reference to FIG. 17, the Data Set Preprocessor 1702 performs twooverall functions—building one or more tagged files from the receiveddata set files and dissecting the data set into individual objects, forexample, sentences. These functions are described in detail below withrespect to FIGS. 27-29. Although FIGS. 27-29 present a particularordering of steps and are oriented to a data set of objects comprisingdocuments, one skilled in the art will recognize that these flowdiagrams, as well as all others described herein, are examples of oneembodiment. Other sequences, orderings and groupings of steps, and othersteps that achieve similar functions, are equivalent to and contemplatedby the methods and systems of the present invention. These include stepsand ordering modifications oriented toward non-textual objects in a dataset, such as audio or video objects.

FIG. 27 is an example flow diagram of the steps performed by abuild_file routine within the Data Set Preprocessor component of aSyntactic Query Engine. The build_file routine generates text for anynon-textual entities within the dataset, identifies document structures(e.g., chapters or sections in a book), and generates one or more taggedfiles for the data set. In one embodiment, the build_file routinegenerates one tagged file containing the entire data set. In alternateembodiments, multiple files may be generated, for example, one file foreach object (e.g., document) in the data set. In step 2701, thebuild_file routine creates a text file. In step 2702, the build_fileroutine determines the structure of the individual elements that make upthe data set. This structure can be previously determined, for example,by a system administrator and indicated within the data set using, forexample, HTML tags. For example, if the data set is a book, the definedstructure may identify each section or chapter of the book. These HTMLtags can be used to define document level attributes for each documentin the data set. In step 2703, the build_file routine tags the beginningand end of each document (or section, as defined by the structure of thedata set). In step 2704, the routine performs OCR processing on anyimages so that it can create searchable text (lexical units) associatedwith each image. In step 2705, the build_file routine creates one ormore sentences for each chart, map, figure, table, or other non-textualentity. For example, for a map of China, the routine may insert asentence of the form,

-   -   This is a map of China.        In step 2706, the build_file routine generates an object        identifier (e.g., (a Document ID) and inserts a tag with the        generated identifier. In step 2707, the build_file routine        writes the processed document to the created text file. Steps        2702 through 2707 are repeated for each file that is submitted        as part of the data set. When there are no more files to        process, the build_file routine returns.

FIG. 28 illustrates an example format of a tagged file built by thebuild_file routine of the Data Set Preprocessor component of a SyntacticQuery Engine. The beginning and end of each document in the file ismarked, respectively, with a <DOC> tag 2801 and a </DOC> tag 2802. Thebuild_file routine generates a Document ID for each document in thefile. The Document ID is marked by and between a <DOCNO> tag 2803 and a</DOCNO> tag 2804. Table section 2805 shows example sentences created bythe build_file routine to represent lexical units for a table embeddedwithin the document. The first sentence for Table 2805,

-   -   This table shows the Defense forces, 1996,        is generated from the title of the actual table in the document.        The remaining sentences shown in Table 2805, are generated from        the rows in the actual table in the document. One skilled in the        art will recognize that various processes and techniques may be        used to identify documents within the data set and to identify        entities (e.g., tables) within each document. The use of        equivalent and/or alternative processes and markup techniques        and formats, including HTML, XML, and SGML and non-tagged        techniques are contemplated and may be incorporated in methods        and systems of the present invention.

The second function performed by the Data Set Preprocessor component ofthe SQE is dissecting the data set into individual objects (e.g.,sentences) to be processed. FIG. 29 is an example flow diagram of thesteps performed by the dissect_file routine of the Data Set Preprocessorcomponent of a Syntactic Query Engine. In step 2901, the routineextracts a sentence from the tagged text file containing the data set.In step 2902, the dissect_file routine preprocesses the extractedsentence, preparing the sentence for parsing. The preprocessing step maycomprise any functions necessary to prepare a sentence according to therequirements of the natural language parser component of the ENLP. Thesefunctions may include, for example, spell checking, removing excessivewhite space, removing extraneous punctuation, and/or converting terms tolowercase, uppercase, or proper case. One skilled in the art willrecognize that any preprocessing performed to put a sentence into a formthat is acceptable to the natural language parser can be used withtechniques of the present invention. In step 2903, the routine sends thepreprocessed sentence to the ENLP. In step 2904, the routine receives asoutput from the ENLP a normalized data representation of the sentence.In step 2905, the dissect_file routine forwards the original sentenceand the normalized data representation to the Data Set Indexer forfurther processing. Steps 2901-2905 are repeated for each sentence inthe file. When no more sentences remain, the dissect_file routinereturns.

The Data Set Indexer (e.g., component 1705 in FIG. 17) prepares thenormalized data generated from the data set (e.g., as illustrated inFIG. 16) to be indexed and stored in the data set repository. Oneskilled in the art will recognize that the normalized data can be storedin a variety of ways and data structures, yet still achieve theabstraction of maintaining a term-clause matrix, a term-sentence matrixor a term-document matrix. Any data structure that can be understood bythe target keyword search engine being used is operable with thetechniques of the present invention. In one embodiment, separate indexesexist for each enhanced document (term-clause, term-sentence, andterm-document) matrix. In addition, in some embodiments the term-clauseindex is further divided into a separate index for each grammaticalrole, so as to allow more efficient keyword searches. The indexes arecross referenced by an internal identifier, which can be used todecipher a document id, sentence id, or a clause id. The tuple (documentid, sentence id, clause id) that uniquely identifies each clause in thedocument corpus. Other divisions and distributions of the data can beaccommodated. Table 1 below conceptually illustrates the informationthat is maintained in an example term-clause index of the presentinvention. TABLE 1 Field Name Type Description Id (internal) Indexed,document id, sentence id, clause id stored concatenated separated by ‘_’subject tokenized, contains subjects(s), subject modifiers and indexedentity type(s) for subjects and modifiers. The modifiers are preferablyseparated into prefix and suffix. If subject has entity type, the dataindexer also stores t_entity (just once). If any modifier has entitytype, the data indexer also stores tm_entity (just once). Noun phrasesthat were recognized by NL parser are also stored with spaces replacedby ‘\.’ The subject field order is:   prefix_subject_mod subject  suffix_subject_mod   Entity_types   NLP_noun_phrases. objecttokenized, contains objects(s), object modifiers and indexed entitytype(s) for objects and modifiers The modifiers are preferably separatedinto prefix and suffix. If object has entity type, the data indexerstores t_entity (just once). If any modifier has entity type, the dataindexer also stores tm_entity (just once). Noun phrases that wererecognized by NL parser are also stored with spaces replaced by ‘\.’ Theobject field order is:   prefix_object_mod object   suffix_object_mod  Entity_types.   NLP_noun_phrases. pcomp tokenized, contains pcomp(s),preposition(s), pcomp indexed modifiers and entity type(s) for pcomp,modifiers. The modifiers are preferably separated into prefix andsuffix. If pcomp has entity type, the data indexer also store t_entity(just once). If any modifier has entity type, the data indexer alsostores tm_entity (just once). Noun phrases that were recognized by NLparser are also stored with spaces replaced by ‘\.’ The pcomp fieldorder is:   preposition pcomp modifiers,   pcomp Entity_types  NLP_noun_phrases verb tokenized, contains verbs(s), verb modifiers andindexed entity type(s) for verbs and modifiers. Noun phrases that wererecognized by NL parser are also stored with spaces replaced by ‘\.’ Theverb field order is:   prefix_verb_mod verb suffix_   verb_modEntity_types   NLP_noun_phrases. parent_id indexed, clause id(10) storedclause_rel_ tokenized, Contains inter-clause relationships suchsent_class indexed as:   conditional_c   causal_c   prepositional_c  temporal_c and Sentence Attributes such as:   question_s  definition_s   temporal_s   numerical_s. relationship stored (Encodedclause for display)As can be observed from Table 1, a variety of information is indexed tocorrespond to the term-clause index. “Entity_types” includes whatevertypes are supported by the ontology. In a default system, several typesof entities are supported; however, one skilled in the art willrecognize that other categorizations of types could also be supported.Similarly, particular exemplary sentence and inter-clause relationshiptypes are listed, however other classifications are supported as well.

FIG. 30 is an example conceptual block diagram of a sentence that hasbeen indexed and stored in a term-clause index of a Syntactic QueryEngine. The example sentence illustrated is “Jane admires sunny Seattleon a busy June 3rd.” The id field 3001 is an internal string that cancross-reference to the corresponding clause, sentence, and document. Thesubject field 3002 includes the term “Jane” (the subject), which has nomodifiers, but is a member of two classifications in the ontology: anindividual (t_entity/person/any/individual) and a female(t_entity/person/female). The field also stores that the subject has anentity type (indicated as t_entity). The verb field 3003 includes thestemmed form of the verb term “admires” (the verb), followed by a seriesof suffix modifiers of the verb, which appear also as parts ofprepositional phrases in pcomp field 3005. The modifiers (m_on, m_busy,m_June, m_(—)3rd) are stored in the verb field along with theinformation that at least one of the modifiers has an entity type(indicated by a tm_entity tag) and that the entity type in the modifierlist includes a date (tm_entity/temporal/date). As illustrated, theobject field 3004 includes the term “Seattle,” along with annotationsthat it has an entity type (t_entity) of city (t_entity/location/city)and has a series of prefix and suffix modifiers (m_sunny, m_on, m_busy,m_June, m_(—)3rd) that have entity types (tm_entity) including a date(tm_entity/temporal/date). The pcomp (prepositional complement) field3005 includes the terms in the prepositional phrase “on a busy June 3rd”stored with the phrase “June 3^(rd)” as the prepositional complement andthe other terms as modifiers. The phrase is recognized as an entity,hence the pcomp field includes an entity type (t_entity) of date(t_entity/temporal/date). The parent_id field 3006 indicated the clauseid of the parent clause in the sentence if there are multiple clauses.The clause_rel_sent_class field 3007 indicates any inter-clauserelationships, such as whether the clause is a conditional phrase, andany sentence attributes such as an annotation that the sentence is, asin this case, a temporal statement. Such classifications enable keywordsearching based upon classifications of sentences as well as othersyntactic and semantic tags. The relationship field 3008 is used fordisplaying the clause and is implementation specific.

Table 2 below conceptually illustrates the information that ismaintained in an example sentence index of the present invention. Sincethe terms with syntactic and semantic annotations are stored in theterm-clause index, the enhanced indexing information can be identifiedby the sentence index, but is not typically stored as part of it. TABLE2 Field Name Type Description sentid indexed Document id sentence idseparated by ‘_’ sent_text Stored String content of the sentenceTable 2 includes an indicator to the entire content of the sentence, andan identifier that will enable cross referencing to the internal clauseids of the clauses that constitute the sentences. The identifier alsocross-references to the document that contains the sentence.

Table 3 below conceptually illustrates the information that ismaintained in an example document index of the present invention. Sincethe terms with syntactic and semantic annotations are stored in theterm-clause index, the enhanced indexing information can be identifiedby the document index, but is not typically stored as part of it. TABLE3 Field Name Type Description doc_id Indexed, stored Document iddhs_doc_id stored DHS_doc_id (URL in one embodiment) title Tokenized,Document title Indexed, stored creationDate Indexed, stored Documentcreation date; format: yyyy.MM.dd-HH:mm:ss metatag Tokenized,MetatagName#MetatagValue Indexed, stored content Tokenized, Stringcontent of the document Indexed, Not Stored document_type storedDocument type (HTML, MSWORD)The document index stores document tag information that is createdtypically during the data set preprocessing stage as well the meta-datatags and (an indicator to) the full document content. The type of thedocument is also maintained.

FIG. 31 is an example conceptual block diagram of sample contents of adocument index of a Syntactic Query Engine. The doc_id field 3101contains a document identifier; the title filed 3102 contains a stringrepresenting the title, the creationDate field 3103 indicates the datethe document was created if known. The metadata field 3104 includes aseries of meta data tags, each with the metadata name followed by itsvalue. The content field 3105 contains an indicator to the stringcontent of the document. The document_type field 3106 is an indicator ofthe format of document (such as an HTML file) determined typicallyduring the data set preprocessing stage.

Although specific embodiments of, and examples for, methods and systemsof the present invention are described herein for illustrative purposes,it is not intended that the invention be limited to these embodiments.Equivalent methods, structures, processes, steps, and othermodifications within the spirit of the invention fall within the scopeof the invention. The various embodiments described above can becombined to provide further embodiments. Also, all of the above U.S.patents and patent publications referred to in this specification,including U.S. patent application Ser. No. 10/007,299, filed on Nov. 8,2001, entitled “Method and System for Enhanced Data Searching,” andpublished as U.S. patent Publication No. 2004/0221235; and U.S. patentapplication Ser. No. 10/371,399, filed on Nov. 8, 2001, entitled “Methodand System for Enhanced Data Searching”, and published as U.S. patentPublication No. 2003/0233224; are incorporated herein by reference, intheir entirety.

Aspects of the invention can be modified, if necessary, to employmethods, systems and concepts of these various patents, applications andpublications to provide yet further embodiments of the invention. Inaddition, those skilled in the art will understand how to make changesand modifications to the methods and systems described to meet theirspecific requirements or conditions. For example, the methods andsystems described herein can be applied to any type of search tool orindexing of a data set, and not just the SQE described. In addition, thetechniques described may be applied to other types of methods andsystems where large data sets must be efficiently reviewed. For example,these techniques may be applied to Internet search tools implemented ona PDA, web-enabled cellular phones, or embedded in other devices.Furthermore, the data sets may comprise data in any language or in anycombination of languages. In addition, the user interface and APIcomponents described may be implemented to effectively support wirelessand handheld devices, for example, PDAs, and other similar devices, withlimited screen real estate. These and other changes may be made to theinvention in light of the above-detailed description. Accordingly, theinvention is not limited by the disclosure.

1. A method in a computer system for preparing a corpus of documents forperforming electronic searches, each document having at least onesentence, each sentence having a plurality of terms, comprising: foreach sentence of each document, parsing the sentence to generate a parsestructure having a plurality of syntactic elements that correspond tothe terms of the sentence; normalizing a plurality of the syntacticelements of the generated parse structure to a plurality of taggedterms, each tagged term indicating an association between the term thatcorresponds to the syntactic element and an associated tag type;transforming each sentence to an enhanced data structure of terms,wherein the plurality of the tagged terms are treated as additionalterms of the sentence, thereby enabling a search engine to determinefrom the enhanced data structure whether a designated term having anassociated tag type is present in the sentence in a similar manner tothe manner the search engine uses to determine whether a designated termis present in the sentence.
 2. The method of claim 1 wherein the searchengine is at least one of a keyword search engine or a Boolean searchengine.
 3. The method of claim 1 wherein the search engine performsstring matching to determine whether the designated term having theassociated tag type is present in the sentence.
 4. The method of claim 1wherein the enhanced data structure is an index structure that indexesthe terms of the sentences along with the tagged terms of the sentence,thereby treating the terms and the tagged terms as searchable keywords.5. The method of claim 1 wherein the enhanced data structure is anaugmented term-document matrix.
 6. The method of claim 1 wherein thetransforming each sentence to the enhanced data structure is performedfor each clause of each sentence such that the tagged terms are treatedas additional terms of each clause of the sentence and the search enginedetermines whether the designated syntactic term having the associatedtag type is present in each clause.
 7. The method of claim 1 wherein theassociated tag type is at least one of an entity tag, an ontology path,a part-of-speech tag, a grammatical role tag, or an action attributetag.
 8. The method of claim 7 wherein the grammatical role tag is atleast one of a subject, object, verb, noun phrase, or a modifier.
 9. Themethod of claim 8 wherein the verb is a governing verb.
 10. The methodof claim 7 wherein the entity tag is part of a hierarchical organizationof tags or is a specified ontology path.
 11. The method of claim 1wherein the normalizing the plurality of the syntactic elements of thegenerated parse structure comprises: applying linguistic normalizationtechniques to a plurality of syntactic elements of the generated parsestructure to generate a plurality of tagged terms, each tagged termindicating an association between the term that corresponds to thesyntactic element and an associated tag type.
 12. The method of claim 11wherein the linguistic normalization techniques include applying atleast one of a transformational grammar rule, a morphologicalsimplification rule, a coreference resolution rule, a verbalizationrule, or a verb sense rule.
 13. The method of claim 12 wherein theverbalization rule is at least one of a noun verbalization rule, anadjective verbalization rule, or an adverb verbalization rule.
 14. Themethod of claim 12 wherein the verbalization rule performs verb phrasesimplification.
 15. The method of claim 12 wherein the coreferenceresolution rule is applied to at least one of a noun, a pronoun, a nounphrase, a pronoun phrase, alias, abbreviation, or acronym.
 16. Themethod of claim 11 wherein the linguistic normalization techniquesinclude applying at least one rule that normalize a set of synonyms andacronyms to a standard term or phrase.
 17. The method of claim 11wherein the linguistic normalization techniques comprise identifying andgenerating tagged terms that include hypernyms and hyponyms.
 18. Themethod of claim 11 wherein the linguistic normalization techniquescomprise identifying and generating tagged terms that include actionattributes.
 19. The method of claim 18 wherein the action attributescomprise a verb tense.
 20. The method of claim 18 wherein the actionattributes comprise a verb mood or modality indication such as whetherthe verb indicates a possibility, subjunctive, irrealis, negation,conditional, or causal relationship.
 21. The method of claim 18 whereinthe action attributes comprise similar verbs.
 22. The method of claim 18wherein the action attributes comprise troponyms, verb entailments, orhypernyms.
 23. The method of claim 1, further comprising: receiving aquery that specifies a relationship search by designating at least oneof a term, a tag type, or a tag value; translating the query to a set ofBoolean expressions; executing a search engine that evaluates theBoolean expressions against the enhanced data structures of thesentences to determine a set of sentence clauses that match the query;and returning indications to the set of matching sentence clauses. 24.The method of claim 23 wherein the received query specifies therelationship search by means of a natural language query that istransformed to the designated at least one of a term, a tag type, or atag value.
 25. The method of claim 23 wherein the received queryspecifies a relationship search in combination with a document levelBoolean search for at least one keyword.
 26. The method of claim 23wherein the received query specifies a relationship search that isconstrained by an expression that indicates a keyword search of thedocuments for at least one search term.
 27. The method of claim 23wherein the received query specifies a relationship search that isconstrained by a meta-data tag expression.
 28. The method of claim 23wherein the received query specifies a relationship search that isconstrained by an expression that indicates a value of a prepositionalphrase.
 29. The method of claim 23 wherein the tag type specifies atleast one of a grammatical role, an entity specification, or a path inan ontology.
 30. The method of claim 23 wherein the tag type specifies asubject, an object, or a verb.
 31. The method of claim 23 wherein thetag value specifies a term that corresponds to at least one of aspecified grammatical role or a path in an ontology.
 32. The method ofclaim 23 wherein the relationship search specifies a search term using awildcard.
 33. The method of claim 32 wherein the wildcard indicates asingle character, range of characters, whole word, range of words, or aspecific occurrence of a word.
 34. The method of claim 23 wherein therelationship search designates at least one of a subject, an object, ora verb and the search engine determines all clauses in the corpus ofdocuments where a grammatical relationship exists that satisfies thedesignated at least one subject, object, or verb.
 35. The method ofclaim 34 wherein the relationship search designates a value of a subjectand the search engine determines a corresponding object and acorresponding verb of all clauses that contain a subject having thedesignated value.
 36. The method of claim 34 wherein the relationshipsearch designates a value of a object and the search engine determines acorresponding subject and a corresponding verb of all clauses thatcontain an object having the designated value.
 37. The method of claim34 wherein the relationship search designates a value of a verb and thesearch engine determines a corresponding subject and a correspondingobject of all clauses that contain a verb having the designated value ora similar verb to the designated value.
 38. The method of claim 34wherein the relationship search designates a wildcard for at least oneof the values of the designated at least one subject, object, or verb.39. The method of claim 34 wherein the search engine considers thepresence in a sentence clause of a term having a modifier grammaticalrole that relates to the designated subject, object, or verb as a matchto the designated subject, object, or verb.
 40. The method of claim 23wherein the search engine is an off- the-shelf keyword search engine.41. The method of claim 1, further comprising: receiving a script thatspecifies a plurality of queries in a script language, each queryspecifying a relationship search that designates at least one of a term,a tag type, or a tag value; translating the plurality of queries to aset of Boolean expressions; executing a search engine that evaluates theBoolean expressions against the enhanced data structures of thesentences to determine a set of sentence clauses that match the Booleanexpressions according to the script.
 42. The method of claim 41 whereinthe script comprises at least one of control flow instructions, groupconstructs, query order, or functions.
 43. The method of claim 1 whereinthe terms of the sentence and the additional terms are indexed in amatrix that tracks occurrences of the terms across the corpus ofdocuments.
 44. The method of claim 43 wherein the matrix is at least oneof a term-document matrix, a term-sentence matrix, or a term-clausematrix.
 45. The method of claim 43 wherein the matrix is a plurality ofterm-clause matrices, each corresponding to a different grammaticalrole.
 46. The method of claim 45 wherein the plurality of term-clausematrices comprise a subject index, an object index, and a verb index.47. The method of claim 43 wherein each term with a designated tag typeis associated with a location that corresponds to a particular clause,sentence, and document.
 48. A computer-readable memory medium containinginstructions that control a computer processor to index a corpus ofdocuments for electronic searching, each document having at least onesentence, each sentence having a plurality of terms, by: for eachsentence of each document, parsing the sentence to generate a parsestructure having a plurality of syntactic elements that correspond tothe terms of the sentence; normalizing a plurality of the syntacticelements of the generated parse structure to a plurality of taggedterms, each tagged term indicating an association between the term thatcorresponds to the syntactic element and an associated tag type;transforming each sentence to an enhanced data structure of terms,wherein the plurality of the tagged terms are treated as additionalterms of the sentence, thereby enabling a search engine, to determinefrom the enhanced data structure whether a designated term having anassociated tag type is present in the sentence in a similar manner tothe manner the search engine uses to determine whether a designated termis present in the sentence.
 49. The memory medium of claim 48 whereinthe search engine is at least one of a keyword search engine or aBoolean search engine.
 50. The memory medium of claim 48 wherein thesearch engine performs string matching to determine whether thedesignated term having the associated tag type is present in thesentence.
 51. The memory medium of claim 48 wherein the enhanced datastructure is an index structure that indexes the terms of the sentencesalong with the tagged terms of the sentence, thereby treating the termsand the tagged terms as searchable keywords.
 52. The memory medium ofclaim 48 wherein the enhanced data structure is an augmentedterm-document matrix.
 53. The memory medium of claim 48 wherein thetransforming each sentence to the enhanced data structure is performedfor each clause of each sentence such that the tagged terms are treatedas additional terms of each clause of the sentence and the search enginedetermines whether the designated syntactic term having the associatedtag type is present in each clause.
 54. The memory medium of claim 48wherein the associated tag type is at least one of an entity tag, anontology path, a part-of-speech tag, a grammatical role tag, or anaction attribute tag.
 55. The memory medium of claim 54 wherein thegrammatical role tag is at least one of a subject, object, verb, nounphrase, or a modifier.
 56. The memory medium of claim 54 wherein theentity tag is part of a hierarchical organization of tags or is aspecified ontology path.
 57. The memory medium of claim 48 wherein thenormalizing the plurality of the syntactic elements of the generatedparse structure comprises: applying linguistic normalization techniquesto a plurality of syntactic elements of the generated parse structure togenerate a plurality of tagged terms, each tagged term indicating anassociation between the term that corresponds to the syntactic elementand an associated tag type.
 58. The memory medium of claim 57 whereinthe linguistic normalization techniques include applying at least one ofa one transformational grammar rule, a morphological simplificationrule, a coreference resolution rule, a verbalization rule, or a verbsense rule.
 59. The memory medium of claim 57 wherein the linguisticnormalization techniques include applying at least one rule thatnormalize a set of synonyms and acronyms to a standard term or phrase.60. The memory medium of claim 57 wherein the linguistic normalizationtechniques comprise identifying and generating tagged terms that includehypernyms and hyponyms.
 61. The memory medium of claim 48, furthercomprising instructions that control the computer processor by:receiving a query that specifies a relationship search by designating atleast one of a term, a tag type, or a tag value; translating the queryto a set of Boolean expressions; executing a search engine thatevaluates the Boolean expressions against the enhanced data structuresof the sentences to determine a set of sentence clauses that match thequery; and returning indications to the set of matching sentenceclauses.
 62. The memory medium of claim 61 wherein the received queryspecifies the relationship search by means of a natural language querythat is transformed to the designated at least one of a term, a tagtype, or a tag value.
 63. The memory medium of claim 61 wherein thereceived query specifies a relationship search in combination with adocument level Boolean search for at least one keyword.
 64. The memorymedium of claim 61 wherein the received query specifies a relationshipsearch that is constrained by at least one of a meta-data tag expressionor an expression that indicates a value of a prepositional phrase. 65.The memory medium of claim 61 wherein the tag type specifies at leastone of a grammatical role or an path in an ontology.
 66. The memorymedium of claim 61 wherein the tag value specifies a term thatcorresponds to at least one of a specified grammatical role, an entityspecification, or a path in an ontology.
 67. The memory medium of claim61 wherein the relationship search specifies a search term using awildcard.
 68. The memory medium of claim 67 wherein the wildcardindicates a single character, range of characters, whole word, range ofwords, or a specific occurrence of a word.
 69. The memory medium ofclaim 61 wherein the relationship search designates at least one of asubject, an object, or a verb and the search engine determines allclauses in the corpus of documents where a grammatical relationshipexists that satisfies the designated at least one subject, object, orverb.
 70. The memory medium of claim 69 wherein the relationship searchdesignates a value of a subject and the search engine determines acorresponding object and a corresponding verb of all clauses thatcontain a subject having the designated value.
 71. The memory medium ofclaim 69 wherein the relationship search designates a value of a objectand the search engine determines a corresponding subject and acorresponding verb of all clauses that contain an object having thedesignated value.
 72. The memory medium of claim 69 wherein therelationship search designates a value of a verb and the search enginedetermines a corresponding subject and a corresponding object of allclauses that contain a verb having the designated value or a similarverb to the designated value.
 73. The memory medium of claim 69 whereinthe relationship search designates a wildcard for at least one of thevalues of the designated at least one subject, object, or verb.
 74. Thememory medium of claim 69 wherein the search engine considers thepresence in a sentence clause of a term having a modifier grammaticalrole that relates to the designated subject, object, or verb as a matchto the designated subject, object, or verb.
 75. The memory medium ofclaim 61 wherein the search engine is an off-the-shelf keyword searchengine.
 76. The memory medium of claim 48, further comprisinginstructions that control the computer processor by: receiving a scriptthat specifies a plurality of queries in a script language, each queryspecifying a relationship search that designates at least one of a term,a tag type, or a tag value; translating the plurality of queries to aset of Boolean expressions; executing a search engine that evaluates theBoolean expressions against the enhanced data structures of thesentences to determine a set of sentence clauses that match the Booleanexpressions according to the script.
 77. The memory medium of claim 76wherein the script comprises at least one of control flow instructions,group constructs, query order, or functions.
 78. The memory medium ofclaim 48 wherein the terms of the sentence and the additional terms areindexed in a matrix that tracks occurrences of the terms across thecorpus of documents.
 79. The memory medium of claim 78 wherein thematrix is at least one of a term-document matrix, a term-sentencematrix, or a term-clause matrix.
 80. The memory medium of claim 78wherein the matrix is a plurality of term-clause matrices, eachcorresponding to a different grammatical role.
 81. The memory medium ofclaim 78 wherein each term with a designated tag type is associated witha location that corresponds to a particular clause, sentence, anddocument.
 82. A computer system that indexes a corpus of documents forelectronic searching, each document having at least one sentence, eachsentence having a plurality of terms, comprising: a parser that parserseach sentence of each document to generate a dependency structure thatspecifies a plurality of syntactic elements that correspond to the termsof the sentence and their relationship to each other; a post processingmodule that is structured to normalize the dependency structure to aplurality of tagged terms, each tagged term indicating an associationbetween the term that corresponds to the syntactic element and anassociated tag type; a sentence transformation module that is structuredto transform the plurality of tagged terms to an enhanced data structurethat stores and treats each tagged term as an encoded additional term ofthe sentence, thereby enabling a search engine, to determine from theenhanced data structure whether a designated term having an associatedtag type is present in the sentence in a similar manner to the mannerthe search engine uses to determine whether a designated term is presentin the sentence.
 83. The system of claim 82 wherein the search engine isat least one of a keyword search or a Boolean search engine.
 84. Thesystem of claim 82 wherein the search engine performs string matching todetermine whether the designated term having the associated tag type ispresent in the sentence.
 85. The system of claim 82 wherein the enhanceddata structure is an index structure that indexes the terms of thesentences along with the tagged terms of the sentence, thereby treatingthe terms and the tagged terms as searchable keywords.
 86. The system ofclaim 82 wherein the enhanced data structure is an augmentedterm-document matrix.
 87. The system of claim 82 wherein thetransforming each sentence to the enhanced data structure is performedfor each clause of each sentence such that the tagged terms are treatedas additional terms of each clause of the sentence and the search enginedetermines whether the designated syntactic term having the associatedtag type is present in each clause.
 88. The system of claim 82 whereinthe associated tag type is at least one of an entity tag, an ontologypath, a part-of-speech tag, a grammatical role tag, or an actionattribute tag.
 89. The system of claim 88 wherein the entity tag is partof a hierarchical organization of tags or is a specified ontology path.90. The system of claim 82 wherein the post processing module isstructured to normalize the plurality of the syntactic elements of thegenerated parse structure by applying linguistic normalizationtechniques to a plurality of syntactic elements of the generated parsestructure to generate a plurality of tagged terms, each tagged termindicating an association between the term that corresponds to thesyntactic element and an associated tag type.
 91. The system of claim 90wherein the linguistic normalization techniques include applying atleast one of a one transformational grammar rule, a morphologicalsimplification rule, a coreference resolution rule, a verbalizationrule, or a verb sense rule.
 92. The system of claim 90 wherein thelinguistic normalization techniques include applying at least one rulethat normalize a set of synonyms and acronyms to a standard term orphrase.
 93. The system of claim 90 wherein the linguistic normalizationtechniques comprise identifying and generating tagged terms that includehypernyms and hyponyms.
 94. The system of claim 82, further comprising:a query interface module that is structured to receive a query thatspecifies a relationship search by indicating at least one of a term, atag type, or a tag value; translate the query to a set of Booleanexpressions; execute a search engine that evaluates the Booleanexpressions against the enhanced data structures of the sentences todetermine a set of sentence clauses that match the query; and returnindications to the set of matching sentence clauses.
 95. The system ofclaim 94 wherein the received query specifies the relationship search bymeans of a natural language query that is transformed to the designatedat least one of a term, a tag type, or a tag value.
 96. The system ofclaim 94 wherein the received query specifies a relationship search incombination with a document level Boolean search for at least onekeyword.
 97. The system of claim 94 wherein the received query specifiesa relationship search that is constrained by at least one of a meta-datatag expression or an expression that indicates a value of aprepositional phrase.
 98. The system of claim 94 wherein therelationship search specifies a search term using a wildcard.
 99. Thesystem of claim 94 wherein the relationship search designates at leastone of a subject, an object, or a verb and the search engine determinesall clauses in the corpus of documents where a grammatical relationshipexists that satisfies the designated at least one subject, object, orverb.
 100. The system of claim 99 wherein the search engine considersthe presence in a sentence clause of a term having a modifiergrammatical role that relates to the designated subject, object, or verbas a match to the designated subject, object, or verb.
 101. The systemof claim 94 wherein the search engine is an off- the-shelf keywordsearch engine.
 102. The system of claim 82, further comprising: a queryinterface module that is structured to receive a script that specifies aplurality of queries in a script language, each query specifying arelationship search that designates at least one of a term, a tag type,or a tag value; translate the plurality of queries to a set of Booleanexpressions; execute a search engine that evaluates the Booleanexpressions against the enhanced data structures of the sentences todetermine a set of sentence clauses that match the Boolean expressionsaccording to the script.
 103. The system of claim 102 wherein the scriptcomprises at least one of control flow instructions, group constructs,query order, or functions.
 104. The system of claim 82 wherein the termsof the sentence and the additional terms are indexed in a matrix thattracks occurrences of the terms across the corpus of documents.
 105. Thesystem of claim 104 wherein the matrix is at least one of aterm-document matrix, a term-sentence matrix, or a term-clause matrix.106. The system of claim 104 wherein the matrix is a plurality ofterm-clause matrices, each corresponding to a different grammaticalrole.
 107. The system of claim 104 wherein each term with a designatedtag type is associated with a location that corresponds to a particularclause, sentence, and document.
 108. A method in a computer system forperforming a search of a corpus of documents, each document having atleast one sentence, comprising: receiving a search query that designatesa desired grammatical relationship between a first entity and at leastone of a second entity or an action; transforming the search query intoa Boolean expression; determining a set of objects that match theBoolean expression using a keyword-style search of a data structure thatindexes terms of the documents including grammatical relationshipinformation; and returning an indication of each matching object in thecorpus that encompasses the desired relationship.
 109. The method ofclaim 108 wherein the determining the set of object determines objectsare at least one of clauses, sentences, paragraphs, or documents. 110.The method of claim 108 wherein the data structure stores thegrammatical relationship information as additional terms of thedocuments.
 111. The method of claim 108 wherein the designated at leastone second entity or the action indicates a desire to match any secondentity.
 112. The method of claim 111, each sentence of each documentcomprising at least one clause, wherein the any second entity is anyterm used as a subject of a clause of a sentence.
 113. The method ofclaim 111, each sentence of each document comprising at least oneclause, wherein the any second entity is any term used as an object of aclause of a sentence.
 114. The method of claim 111 wherein the firstentity is any term that matches a specified entity type or ontology pathspecification.
 115. The method of claim 108 wherein the first entity isany term that matches a specified entity type or ontology pathspecification.
 116. The method of claim 108 wherein the designated atleast one second entity or the action indicates a desire to match anyaction.
 117. The method of claim 108 wherein the designated at least onesecond entity or the action is a verb.
 118. The method of claim 117wherein the returning the indication of each matching object thatencompasses the desired relationship returns indications to objects thatcontain similar verbs to the designated verb.
 119. The method of claim117 wherein the returning the indication of each matching object thatencompasses the desired relationship returns indications to objects thatcontain the same verb as the designated verb.
 120. method of claim 117wherein the returning the indication of each matching object thatencompasses the desired relationship returns indications to objects thatcontain verbs of a similar classification to the designated verb. 121.The method of claim 108 wherein the designated at least one secondentity or the action indicates a desire to match any action and a desireto match any second entity.
 122. The method of claim 121 wherein thefirst entity is any term that matches a specified entity type.
 123. Themethod of claim 108 wherein the receiving the search query thatdesignates the desired grammatical relationship between a first entityand at least one of a second entity or an action specifies at least oneof a prepositional constraint, a document keyword constraint, or adocument metadata constraint.
 124. The method of claim 108 wherein thesearch query includes a Boolean operation.
 125. The method of claim 124wherein the Boolean operation includes an AND, OR, or NOT operation.126. The method of claim 108 wherein the search query includes anoperator that specifies at least one of a proximity, a range, awildcard, a weighted search based upon frequency, or a weighted keywordsearch operation.
 127. The method of claim 108 wherein the search queryincludes a designation of at least one entity type.
 128. The method ofclaim 127 wherein the at least one entity type is a path specificationin a classification system.
 129. The method of claim 127 wherein the atleast one entity type is a path specification in a taxonomy that isspecific to the corpus.
 130. The method of claim 108 wherein the searchquery includes a wildcard specification in the designation of thedesired grammatical relationship.
 131. The method of claim 130 whereinthe wildcard specification is one of a single character wildcardoperator, a multi-character wildcard operator, or a word wildcardoperator.
 132. The method of claim 108, wherein the search querydesignates a desired grammatical relationship between the first entityand the second entity, the search query further designating a linkentity specification that used to link the first entity and the secondentity.
 133. The method of claim 132 wherein the link entityspecification is an entity type.
 134. The method of claim 132 whereinthe link entity specification is a path specification in aclassification system.
 135. The method of claim 108 wherein thetransforming the search query to generate a Boolean expressionincorporates transformational grammar rules to generate relatedgrammatical relationships to search for.
 136. The method of claim 108wherein the generated Boolean expression includes an expression thatcauses a search for the desired grammatical relationship using at leastone modifier.
 137. The method of claim 136 wherein the at least onemodifier is at least one of a subject modifier, an object modifier, averb modifier, or an argument of preposition.
 138. The method of claim136 wherein the expression that causes a search for the desiredgrammatical relationship using the at least one modifier specifies anexpression in which the modifier acts as a part of the first entity orthe second entity.
 139. The method of claim 136 wherein the expressionthat causes a search for the desired grammatical relationship using theat least one modifier specifies an expression in which the modifier actsas a part of the action.
 140. The method of claim 108 wherein the datastructure is a reverse index of terms that indexes at least one ofdocuments, sentences, or clauses.
 141. The method of claim 140 whereinthe indexed terms include tagged terms that each denote a tag typeassociated with the term.
 142. The method of claim 140 wherein theindexed terms include tagged terms that each denote a grammatical roleassociated with the term.
 143. The method of claim 142 wherein theassociated grammatical roles are at least one of subject, object, verb,or modifier.
 144. The method of claim 140 wherein the indexed termsinclude tagged terms that each denote a semantic tag associated with theterm.
 145. The method of claim 144 wherein the associated semantic tagsare path specifications in a classification system.
 146. The method ofclaim 140 wherein the reverse index of terms comprises a plurality ofreverse indices of terms.
 147. The method of claim 108 wherein the datastructure is at least one of a term-document matrix, a term-sentencematrix, or a term-clause matrix.
 148. The method of claim 108 whereinthe determining the set of sentences that match the Boolean expressionperforms pattern matching to determine the desired grammaticalrelationship.
 149. The method of claim 108, the returning the indicationof each matching object in the corpus that encompasses the desiredrelationship comprising: returning an indication of at least one of eachmatching clause, each matching sentence, or each matching document inthe corpus that encompasses the desired relationship.
 150. The method ofclaim 108, the returning the indication of each matching object in thecorpus that encompasses the desired relationship comprising: in responseto receiving a search query that designates a desired grammaticalrelationship between a first entity and any action, returning anindication of each matching object in the corpus that encompasses thefirst entity along with an indication of a corresponding actionencompassed in the matching object.
 151. The method of claim 108, thedata structure that indexes terms of the documents including grammaticalrelationship information being stored across a plurality of storagerepositories, wherein the determining the set of objects that match theBoolean expression using a keyword-style search of the data structure,further comprises: performing a keyword-style search of the datastructure against each storage repository that contains the portion ofthe index; and merging the results of the search to return theindication of each matching object in the corpus that encompasses thedesired relationship.
 152. The method of claim 151 wherein thekeyword-style searches against each storage repository that contains theportion of the index are performed using parallel processing techniques.153. A computer-readable memory medium containing instructions tocontrol a computer processor to search a corpus of documents, eachdocument having at least one sentence, by: receiving a search query thatdesignates a desired grammatical relationship between a first entity andat least one of a second entity or an action; transforming the searchquery into a Boolean expression; determining a set of objects that matchthe Boolean expression using a keyword-style search of a data structurethat indexes terms of the documents including grammatical relationshipinformation; and returning an indication of each matching object in thecorpus that encompasses the desired relationship.
 154. The memory mediumof claim 153 wherein the determined objects are at least one of clauses,sentences, paragraphs, or documents.
 155. The memory medium of claim 153wherein the data structure stores the grammatical relationshipinformation as additional terms of the documents.
 156. The memory mediumof claim 153 wherein the designated at least one second entity or theaction indicates a desire to match any second entity.
 157. The memorymedium of claim 156, each sentence of each document comprising at leastone clause, wherein the any second entity is any term used as a subjectof a clause of a sentence.
 158. The memory medium of claim 156, eachsentence of each document comprising at least one clause, wherein theany second entity is any term used as an object of a clause of asentence.
 159. The memory medium of claim 156 wherein the first entityis any term that matches a specified entity type or ontology pathspecification.
 160. The memory medium of claim 153 wherein the firstentity is any term that matches a specified entity type or ontology pathspecification.
 161. The memory medium of claim 153 wherein thedesignated at least one second entity or the action indicates a desireto match any action.
 162. The memory medium of claim 153 wherein thedesignated at least one second entity or the action is a verb.
 163. Thememory medium of claim 162 wherein the returning the indication of eachmatching object that encompasses the desired relationship returnsindications to objects that contain similar verbs to the designatedverb.
 164. The memory medium of claim 162 wherein the returning theindication of each matching object that encompasses the desiredrelationship returns indications to objects that contain the same verbas the designated verb.
 165. memory medium of claim 162 wherein thereturning the indication of each matching object that encompasses thedesired relationship returns indications to objects that contain verbsof a similar classification to the designated verb.
 166. The memorymedium of claim 153 wherein the designated at least one second entity orthe action indicates a desire to match any action and a desire to matchany second entity.
 167. The memory medium of claim 153 wherein thedesignated desired grammatical relationship specifies at least one of aprepositional constraint, a document keyword constraint, or a documentmetadata constraint.
 168. The memory medium of claim 153 wherein thesearch query includes a Boolean operation.
 169. The memory medium ofclaim 153 wherein the search query includes an operator that specifiesat least one of a proximity, a range, a wildcard, a weighted searchbased upon frequency, or a weighted keyword search operation.
 170. Thememory medium of claim 153 wherein the search query includes adesignation of at least one entity type or path specification in aclassification system.
 171. The memory medium of claim 153 wherein thesearch query includes a wildcard specification in the designation of thedesired grammatical relationship.
 172. The memory medium of claim 153,wherein the search query designates a desired grammatical relationshipbetween the first entity and the second entity, the search query furtherdesignating a link entity specification that used to link the firstentity and the second entity.
 173. The memory medium of claim 153wherein the transforming the search query to generate a Booleanexpression incorporates transformational grammar rules to generaterelated grammatical relationships to search for.
 174. The memory mediumof claim 153 wherein the generated Boolean expression includes anexpression that causes a search for the desired grammatical relationshipusing at least one modifier.
 175. The memory medium of claim 153 whereinthe data structure is a reverse index of terms that indexes at least oneof documents, sentences, or clauses.
 176. The memory medium of claim 175wherein the indexed terms include tagged terms that each denote a tagtype associated with the term.
 177. The memory medium of claim 175wherein the indexed terms include tagged terms that each denote agrammatical role associated with the term.
 178. The memory medium ofclaim 175 wherein the indexed terms include tagged terms that eachdenote a semantic tag associated with the term.
 179. The memory mediumof claim 178 wherein the associated semantic tags are pathspecifications in a classification system.
 180. The memory medium ofclaim 175 wherein the reverse index of terms comprises a plurality ofreverse indices of terms.
 181. The memory medium of claim 153 whereinthe data structure is at least one of a term-document matrix, aterm-sentence matrix, or a term-clause matrix.
 182. The memory medium ofclaim 153 wherein the determining the set of sentences that match theBoolean expression performs pattern matching to determine the desiredgrammatical relationship.
 183. The memory medium of claim 153, thereturning the indication of each matching object in the corpus thatencompasses the desired relationship comprising: returning an indicationof at least one of each matching clause, each matching sentence, or eachmatching document in the corpus that encompasses the desiredrelationship.
 184. The memory medium of claim 153, the returning theindication of each matching object in the corpus that encompasses thedesired relationship comprising: in response to receiving a search querythat designates a desired grammatical relationship between a firstentity and any action, returning an indication of each matching objectin the corpus that encompasses the first entity along with an indicationof a corresponding action encompassed in the matching object.
 185. Thememory medium of claim 153, the data structure that indexes terms of thedocuments including grammatical relationship information being storedacross a plurality of storage repositories, wherein the determining theset of objects that match the Boolean expression using a keyword-stylesearch of the data structure, further comprises: performing akeyword-style search of the data structure against each storagerepository that contains the portion of the index; and merging theresults of the search to return the indication of each matching objectin the corpus that encompasses the desired relationship.
 186. The memorymedium of claim 185 wherein the keyword-style searches against eachstorage repository that contains the portion of the index are performedusing parallel processing techniques.
 187. A search engine that searchesa corpus of documents, each document having at least one sentence,comprising: a data structure that indexes and stores terms of thedocuments along with annotations that include relationship information,each annotation associated with at least one term; a keyword searchengine that pattern matches an input string against the data structureand returns an indication of each matching object of the corpus; and aquery processor that is structured to receive a relationship searchquery that is indicative of at least one syntactically or semanticallyannotated term; transform the relationship search query into at leastone Boolean expression; invokes the keyword search engine to determine aset of objects that match the at least one Boolean expression by patternmatching the at least one annotated term indicated by the search queryto the data structure, such that each matching object encompasses therelationship specified by the relationship search.
 188. The searchengine of claim 187 wherein the returned indications indicate at leastone of clauses, sentences, paragraphs, or documents.
 189. The searchengine of claim 187 wherein the data structure stores the relationshipinformation as additional terms of the documents.
 190. The search engineof claim 187 wherein the relationship search query specifies a desiredgrammatical relationship between a first entity and at least one of asecond entity or an action.
 191. The search engine of claim 190 whereinthe specified at least one second entity or the action indicates adesire to match any second entity.
 192. The search engine of claim 190wherein the first entity is any term that matches a specified entitytype or ontology path specification.
 193. The search engine of claim 190wherein the specified at least one second entity or the action indicatesa desire to match any action.
 194. The search engine of claim 190wherein the specified at least one second entity or the action is averb.
 195. The search engine of claim 190 wherein the specified at leastone second entity or the action indicates a desire to match any actionand a desire to match any second entity.
 196. The search engine of claim187, the relationship search specifying a desired action, wherein thereturned indications of each matching object that encompasses therelationship returns indications to objects that contain similar verbsto a verb indicated by the desired action, the same verb as the verbindicated by the desired action, or a verb of a classification relatedto the verb indicated by the desired action.
 197. The search engine ofclaim 187 wherein the relationship search query specifies at least oneof a prepositional constraint, a document keyword constraint, or adocument metadata constraint.
 198. The search engine of claim 187wherein the relationship search query includes a Boolean operation. 199.The search engine of claim 187 wherein the relationship search queryincludes an operator that specifies at least one of a proximity, arange, a wildcard, a weighted search based upon frequency, or a weightedkeyword search operation.
 200. The search engine of claim 187 whereinthe relationship search query specifies at least one entity type or pathspecification in a classification system.
 201. The search engine ofclaim 187 wherein the relationship search query includes a wildcardspecification.
 202. The search engine of claim 187, wherein therelationship search query includes a link entity specification.
 203. Thesearch engine of claim 187 wherein the transformed relationship searchquery incorporates transformational grammar rules.
 204. The searchengine of claim 187 wherein the transformed relationship search queryincludes an expression that causes a search using at least one modifier.205. The search engine of claim 187 wherein the annotations that includerelationship information denote a grammatical role of each associatedterm.
 206. The search engine of claim 187 wherein the annotations denotesemantic tags associated with the terms.
 207. The search engine of claim206 wherein the associated semantic tags are path specifications in aclassification system.
 208. The search engine of claim 187 wherein thedata structure is a reverse index of terms that indexes at least one ofdocuments, sentences, or clauses.
 209. The search engine of claim 208wherein the reverse index of terms comprises a plurality of reverseindices of terms.
 210. The search engine of claim 187 wherein the datastructure is at least one of a term-document matrix, a term-sentencematrix, or a term-clause matrix.
 211. The search engine of claim 187,wherein the returned indication of each matching object returns anindication of at least one of each matching clause, each matchingsentence, or each matching document in the corpus that encompasses thedesired relationship.
 212. The search engine of claim 187, the datastructure that indexes and stores terms of the documents storing andindexing the terms with the annotations across a plurality of storagerepositories, and wherein the keyword search engine performs patternmatch searches of the input string against each storage repository thatcontains the portion of the index and merges the results of the patternmatch searches to return the indication of each matching object in thecorpus that encompasses the desired relationship.
 213. The search engineof claim 212 wherein the pattern match searches of the input stringagainst each storage repository that contains the portion of the indexare performed using parallel processing techniques.
 214. Acomputer-readable memory medium containing structured data that stores asyntactic query, the query executed by a computer processor under thecontrol of a search engine to search a corpus of objects for objectsthat match the query, comprising: a base component that specifies valuesfor desired relationship parameters; a prepositional constraintcomponent that specifies a desired value for a prepositional phrase; akeyword constraint component that specifies desired keyword values; anda metadata constraint component that specifies desired values ofmetadata associated with each matching object, whereby, when the searchengine causes the search to be executed, objects that match theconstraints specified by the base component, the prepositionalconstraint component, the keyword constraint component, and the metadataconstraint component are determined to satisfy the query.
 215. Thememory medium of claim 214 wherein one or more of the components of thesyntactic query are optional.
 216. The memory medium of claim 214wherein at least one of the components of the syntactic query isspecified.
 217. The memory medium of claim 214 wherein at least one ofthe components of the syntactic query contains a Boolean expression.218. The memory medium of claim 214 wherein the base component specifiesthe desired relationship parameters in a general syntactic form: Entity1Directional-operator1 Action Directional-operator2 Entity2 wherein atleast one of Entity1, Entity2, and Action parameters contains a non nullvalue that indicates a search term, the Directional-operator1 parameterspecifies the direction of the relationship between the Entity1 and theAction parameters, and the Directional-operator2 parameter specifies thedirection of the relationship between the Entity2 and the Actionparameters.
 219. The memory medium of claim 218 wherein a value of theDirectional-operator parameter is one of a greater-than symbol (“>”), aright arrow symbol (“→”), a less-than symbol (“<”),a left arrow symbol(“←”) or a combination indicating a bi-directional relationship (“< >”or “⇄”).
 220. The memory medium of claim 218 wherein a specification ofa value of “>” or “→” for the Directional-operator1 parameter indicatesthat the value indicated by the Entity1 parameter is a subject of thevalue indicated by the Action parameter.
 221. The memory medium of claim218 wherein a specification of a value of “<” or “←” forDirectional-operator1 parameter indicates that the value indicated bythe Entity1 parameter is an object of the value indicated by the Actionparameter.
 222. The memory medium of claim 218 wherein a specificationof a value of “>” or “→” for Directional-operator2 parameter indicatesthat the value indicated by the Entity2 parameter is an object of thevalue indicated by the Action parameter.
 223. The memory medium of claim218 wherein a specification of a value of “<” or “←” forDirectional-operator2 parameter indicates that that the value indicatedby the Entity2 parameter is a subject of the value indicated by theAction parameter.
 224. The memory medium of claim 218 wherein a valuefor the Action parameter indicates a search term that represents atleast one of a particular verb, similar verbs, or an action type. 225.The memory medium of claim 218 wherein a value for the Action parameterthat is in the form of a quoted verb indicates a particular verb; avalue for the Action parameter that in the form of an unquoted verbindicates similar verbs to that which is specified; and a value for theAction parameter that is in the form of a bracketed verb indicates anaction type.
 226. The memory medium of claim 218 wherein a value for theEntity1 or the Entity2 parameter is a noun or noun phrase.
 227. Thememory medium of claim 218 wherein a value for the Entity1 or theEntity2 parameter is a modifier.
 228. The memory medium of claim 214wherein the prepositional constraint component comprises the phrase“PREP CONTAINS” or the character “{circumflex over ( )}” followed by atleast one search term.
 229. The memory medium of claim 214 wherein thekeyword constraint component comprises the phrase “DOCUMENT CONTAINS” orthe character “;” followed by at least one search term.
 230. The memorymedium of claim 214 wherein the metadata constraint component comprisesthe phrase “METADATA CONTAINS” or the character “#” followed by at leastone expression that specifies a desired value for a metadata variable.231. The memory medium of claim 214 wherein a wildcard can be specifiedas the value of a search term or a parameter of the base component. 232.The memory medium of claim 231 wherein the wildcard is at least one ofthe characters “*” or “?”.
 233. The memory medium of claim 214 whereincurly braces are used to indicate indirect link searches.
 234. Thememory medium of claim 214 wherein square brackets are used to indicatean action type or an entity classification.
 235. A search engineprogrammed to process syntactic queries that are stored in thestructured data contained in the computer readable memory medium andstructured according to claim
 214. 236. A method in a computer systemcomprising storing syntactic queries in the structured data contained inthe computer readable memory medium and structured according to claim214.
 237. A computer-readable memory medium that contains a reverseindex for storing a corpus of documents according to terms present inthe documents, the index accessed by a computer processor that iscontrolled by search engine to match a query against the corpus ofdocuments, the index comprising: a plurality of terms, each termindicating at least one sentence in which the term occurs; and aplurality of tagged terms, each tagged term specifying a syntactic rolethat is associated with the term in the at least one sentence and eachtagged term indicating the at least one sentence in which the associatedterm occurs; such that the search engine can determine, by patternmatching query terms against the reverse index, a set of sentences thatmatch a relationship indicated by the query.
 238. The memory medium ofclaim 237 wherein the search engine is a keyword-style search engine.239. The memory medium of claim 237 wherein the syntactic role specifiedby each tagged term is at least one of a subject, object, verb, ormodifier.
 240. The memory medium of claim 237 wherein a plurality of thetagged terms specify a semantic tag that is associated with the term inthe at least one sentence.
 241. The memory medium of claim 240 whereinthe semantic tag is at least one of an entity tag or a pathspecification in a classification structure.
 242. The memory medium ofclaim 237 wherein the reverse index is a term-clause index wherein eachterm indicates a clause within the indicated at least one sentence inwhich the term occurs.
 243. The memory medium of claim 237 wherein thereverse index is a term-clause index wherein each tagged term indicatesa clause within the indicated at least one sentence in which theassociated term occurs.
 244. A search engine programmed to processqueries against a corpus of documents that are stored in the reverseindex contained in the computer readable memory medium and structuredaccording to claim
 237. 245. The search engine of claim 244 whereinkeyword searching techniques are performed to process queries.
 246. Amethod in a computer system comprising storing a corpus of documents inthe reverse index contained in the computer readable memory medium andstructured according to claim
 214. 247. The method of claim 246 whereinthe reverse index is a term-clause index.
 248. The method of claim 246wherein the reverse index is a term-sentence index.
 249. The method ofclaim 246 wherein the reverse index is a term-document index.